Having issues with shell script concept - bash

Hi, I have a file with following contents
> 1234 alphabet /vag/one/arun
> 1454 bigdata /home/two/ogra
> 5684 apple /vinay/three/dire
but i want the output to be like
> 1234 alphabet one
> 1454 bigdata two
> 5684 apple three

awk '{
split($NF,ar,"/");
$NF=ar[3]
for (i=1;i<=NF;i++) {
printf "%s ",$i
}
printf "\n"
}' filename
Take the last field delimited by space and split it into the array ar based on "/", the last field equal to the third element in ar and then loop through the fields printing them.

Cut each input line into pieces, throw the parts you don't need into a wastebasket, and reassemble what is left. For instance:
grep -Eo '(^|/[^/]+/|/[^/]+$|[^/]+)' <INPUTFILE| grep -Fv /|xargs -L 2 -d '\n' echo >OUTPUTFILE

You can do it simply by controlling IFS (the Internal Field Separator) which when it includes the '/' character will cause filed splitting to occur allowing you to read a/b/c into separate variables. Then it's just a matter of printing the variables you want, e.g. with your original contents in file,
$ while IFS="${IFS}/" read -r n l a b c; do echo "$n $l $b"; done < file
1234 alphabet one
1454 bigdata two
5684 apple three

Related

How to get from a file only the character with reputed value

I need to extract from the file the words that contain certain letters in a certain amount.
I apologize if this question has been resolved in the past, I just did not find anything that fits what I am looking for.
File:
wab 12aaabbb abababx ab ttttt baaabb zabcabc
baab baaabb cbaab ab ccabab zzz
For example
1. If I chose the letters a and the number is 1 the output should be:
wab
ab
ab
//only the words that contains a and the char appear in the word 1 time
2. If I chose the letters a,b and the number is 3, the output should be:
12aaabbb
abababx
baaabb
//only the word contains a,b, and both chars appear in the word 3 times
3. If I chose the letters a,b,c and the number 2, the output should be:
ccabab
zabcabc
//only the words that contains a,b,c and the chars appear in the word 3 times
Is it possible to find 2 letters in the same script?
I was able to find in a single letter but I get only the words where the letters appear in sequence and I do not want to find only these words, that's what I did:
egrep '([a])\1{N-1}' file
And another problem I can not get only the specific words, I get all file and the letter I am looking for "a" in red.
I tried using -w but it does not display anything.
::: EDIT :::
try to edit what you did to a for
i=$1
fileName=$2
letters=${#: 3}
tr -s '[:space:]' '\n' < $fileName* |
for letter in $letters; do
grep -E "^[^$letter]*($letter[^$letter]*){$i}$"
done | uniq
There are various ways to split input so that grep sees a single word per line. tr is most common. For example:
tr -s '[:space:]' '\n' file | ...
We can build a function to find a specific number of a particular letter:
NofL(){
num=$1
letter=$2
regex="^[^$letter]*($letter[^$letter]*){$num}$"
grep -E "$regex"
}
Then:
# letter=a number=1
tr -s '[:space:]' '\n' file | NofL 1 a
# letters=a,b number=3
tr -s '[:space:]' '\n' file | NofL 3 a | NofL 3 b
# letters=a,b,c number=2
tr -s '[:space:]' '\n' file | NofL 2 a | NofL 2 b | NofL 2 c
Regexes are not really suited for that job as there are more efficient ways, but it is possible using repeated matching. We first select all words, from those we select words with n as, and from those we select words with n bs and so on.
Example for n=3 and a, b:
grep -Eo '[[:alnum:]]+' |
grep -Ex '[^a]*a[^a]*a[^a]*a[^a]*' |
grep -Ex '[^b]*b[^b]*b[^b]*b[^b]*'
To auto-generate such a command from an input like 3 a b, you need to dynamically create a pipeline, which is possible, but also a hassle:
exactly_n_times_char() {
(( $# >= 2 )) || { cat; return; }
local n="$1" char="$2" regex
regex="[^$char]*($char[^$char]*){$n}"
shift 2
grep -Ex "$regex" | exactly_n_times_char "$n" "$#"
}
grep -Eo '[[:alnum:]]+' file.txt | exactly_n_times_char 3 a b
With PCREs (requires GNU grep or pcregrep) the check can be done in a single regex:
exactly_n_times_char() {
local n="$1" regex=""
shift
for char; do # could be done without a loop using sed on $*
regex+="(?=[^$char\\W]*($char[^$char\\W]*){$n})"
done
regex+='\w+'
grep -Pow "$regex"
}
exactly_n_times_char 3 a b < file.txt
If a matching word appears multiple times (like baaabb in your example) it is printed multiple times too. You can filter out duplicates by piping through sort -u but that will change the order.
A method using sed and bash would be:
#!/bin/bash
file=$1
n=$2
chars=$3
for ((i = 0; i < ${#chars}; ++i)); do
c=${chars:i:1}
args+=(-e)
args+=("/^\([^$c]*[$c]\)\{$n\}[^$c]*\$/!d")
done
sed "${args[#]}" <(tr -s '[:blank:]' '\n' < "$file")
Notice that filename, count, and characters are parameterized. Use it as
./script filename 2 abc
which should print out
zabcabc
ccabab
given the file content in the question.
An implementation in pure bash, without calling an external program, could be:
#!/bin/bash
readonly file=$1
readonly n=$2
readonly chars=$3
while read -ra words; do
for word in "${words[#]}"; do
for ((i = 0; i < ${#chars}; ++i)); do
c=${word//[^${chars:i:1}]}
(( ${#c} == n )) || continue 2
done
printf '%s\n' "$word"
done
done < "$file"
You can match a string containing exactly N occurrences of character X with the (POSIX-extended) regexp [^X]*(X[^X]*){N}. To do this for multiple characters you could chain them, and the traditional way to process one 'word' at a time, simplistically defined as a sequence of non-whitespace chars, is like this
<infile tr -s ' \t\n' ' ' | grep -Ex '[^a]*(a[^a]*){3}' | \grep -Ex '[^b]*(b[^b]*){3}'
# may need to add \r on Windows-ish systems or for Windows-derived data
If you get colorized output from egrep and grep and maybe some other utilities it's usually because in a GNU-ish environment you -- often via a profile that was automatically provided and you didn't look at or modify -- set aliases to turn them into e.g. egrep --color=auto or possibly/rarely =always; using \grep or command grep or the pathname such as /usr/bin/grep disables the alias, or you could just un-set it/them. Another possibility is you may have envvar(s) set in which case you need to remove or suppress it/them, or explicitly say --color=never, or (somewhat hackily) pipe the output through ... | cat which has the effect of making [e]grep's stdout a pipe not a tty and thus turning off =auto.
However, GNU awk (not necessarily others) can also do this more directly:
<infile awk -vRS='[ \t\n]+' -F '' '{delete f;for(i=1;i<=NF;i++)f[$i]++}
f["a"]==3&&f["b"]==3'
or to parameterize the criteria:
<infile awk -vRS='[ \t\n]+' -F '' 'BEGIN{split("ab",w,//);n=3}
{delete f;for(i=1;i<=NF;i++)f[$i]++;s=1;for(t in w)if(f[w[t]]!=occur)s=0} s'
perl can do pretty much everything awk can do, and so can some other general-purpose tools, but I leave those as exercises.

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

Bash command to read a line based on the parameters I pass - perform column-based lookups

I have a file links.txt:
1 a.sh
3 b.sh
6 c.sh
4 d.sh
So, if i pass 1,4 as parameters to another file(master.sh), a.sh and d.sh should be stored in a variable.
sed '3!d' would print the 3rd line, but not the line that starts with 3. For that, you need sed '/^3 /!d'. The problem is you can't combine them for more lines, as this means "Delete everything that doesn't start with a 3", which means all other lines will be missed. So, use sed -n '/^3 /p' instead, i.e. don't print by default and tell sed what lines to print, not what lines to delete.
You can loop over the argument and create a sed script from them that prints the lines, then run sed using this output:
#!/bin/bash
file=$1
shift
for id in "$#" ; do
echo "/^$id /p"
done | sed -nf- "$file"
Run as script.sh filename 3 4.
If you want to remove the id from the output, you can either use
cut -f2 -d' '
or you can modify the generated sed script to do the work
echo "/^$id /s/.* //p"
i.e. only print if the substitution was successful.
This loops through each argument and greps for it in the links file. The result is piped into cut where we specify the delimiter as a space with -d flag and the field number as 2 with -f flag. Finally this is appended to the array called files.
links="links.txt"
files=()
for arg in $#; do
files=("${files[#]}" `grep "^$arg" "$links" | cut -d" " -f2`)
done;
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Edit:
As pointed out by mklement0, the solution above reads the file once per arg. The following first builds the pattern then reads the file just once.
links="links.txt"
pattern="^$1\s"
for arg in ${#:2}; do
pattern+="|^$arg\s"
done
files=$(grep -E "$pattern" "$links" | cut -d" " -f2)
echo ${files[#]}
Usage:
$ ./master.sh 1 4
a.sh d.sh
Here is another example with grep and cut:
#!/bin/bash
for line in $(grep "$1\|$2" links.txt|cut -d' ' -f2)
do
echo $line
done
Example of usage:
./master.sh 1 4
a.sh
d.sh
Why not just stores the values and call them at will:
items=()
while read -r num file
do
items[num]="$file"
done<links.txt
for arg
do
echo "${items[arg]}"
done
Now you can use the items array any time you like :)
The following awk solution:
preserves the argument order; that is, the results reflect the order in which the lookup values were specified (as opposed to the order in which the lookup values happen to occur in the file).
If that is not important (i.e., if outputting the results in file order is acceptable), the readarray technique below can be combined with this one-liner, which is a generalized variant of Panta's answer:
grep -f <(printf "^%s\n" "$#") links.txt | cut -d' ' -f2-
performs well, because the input file is only read once; the only requirement is that all key-value pairs fit into memory as a whole (as a single associative Awk array (dictionary)).
works with any lookup values that don't have embedded whitespace.
Similarly, the assumption is that the output column values (containing values such as a.sh in the sample input) have no embedded whitespace. awk doesn't handle quoted fields well, so more work would be needed.
#!/bin/bash
readarray -t files < <(
awk -v idList="$*" '
BEGIN { count=split(idList, idArr); for (i in idArr) idDict[idArr[i]]++ }
$1 in idDict { idDict[$1] = $2 }
END { for (i=1; i<=count; ++i) print idDict[idArr[i]] }
' links.txt
)
# Print results.
printf '%s\n' "${files[#]}"
readarray -t files reads stdin input (<) line by line into array variable files.
Note: readarray requires Bash v4+; on Bash 3.x, such as on macOS, replace this part with
IFS=$'\n' read -d '' -ra files
<(...) is a Bash process substitution that, loosely speaking, presents the output from the enclosed command as if it were (self-deleting) temporary file.
This technique allows readarray to run in the current shell (as opposed to a subshell if a pipeline had been used), which is necessary for the files variable to remain defined in the remainder of the script.
The awk command breaks down as follows:
-v idList="$*" passes the space-separated list of all command-line arguments as a single string to Awk variable idList.
Note that this assumes that the arguments have no embedded spaces, which is indeed the case here and also generally the case with identifiers.
BEGIN { ... } is only executed once, before the individual lines are processed:
split(idList, idArr) splits the input ID list into an array by whitespace and stores the result in idArr.
for (i in idArr) idDict[idArr[i]]++ } then converts the (conceptually regular) array into associative array idDict (dictionary), whose keys are the input IDs - this enables efficient lookup by ID later, and also allows storing the lookup result for each ID.
$1 in idDict { idDict[$1] = $2 } is processed for every input line:
Pattern $1 in idDict returns true if the line's first whitespace-separated field ($1) - e.g., 6 - is among the keys (in) of associative array idDict, and, if so, executes the associated action ({...}).
Action { idDict[$1] = $2 } then assigns the second field ($2) - e.g., c.sh - to the iDict entry for key $1.
END { ... } is executed once, after all input lines have been processed:
for (i=1; i<=count; ++i) print idDict[idArr[i]] loops over all input IDs in order and prints each ID's lookup result, which is the value of the dictionary entry with that ID.

Want to sort a file based on another file in unix shell

I have 2 files refer.txt and parse.txt
refer.txt contains the following
julie,remo,rob,whitney,james
parse.txt contains
remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,whitney/hello/1.0,julie/hello/2.0,julie/hello/3.0,rob/hello/4.0,james/hello/6.0
Now my output.txt should list the files in parse.txt based on the order specified in refer.txt
ex of output.txt should be:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
i have tried the following code:
sort -nru refer.txt parse.txt
but no luck.
please assist me.TIA
You can do that using gnu-awk:
awk -F/ -v RS=',|\n' 'FNR==NR{a[$1] = (a[$1])? a[$1] "," $0 : $0 ; next}
{s = (s)? s "," a[$1] : a[$1]} END{print s}' parse.txt refer.txt
Output:
julie/hello/2.0,julie/hello/3.0,remo/hello/1.0,remo/hello2/2.0,remo/hello3/3.0,rob/hello/4.0,whitney/hello/1.0,james/hello/6.0
Explanation:
-F/ # Use field separator as /
-v RS=',|\n' # Use record separator as comma or newline
NR == FNR { # While processing parse.txt
a[$1]=(a[$1])?a[$1] ","$0:$0 # create an array with 1st field as key and value as all the
# records with keys julie, remo, rob etc.
}
{ # while processing the second file refer.txt
s = (s)?s "," a[$1]:a[$1] # aggregate all values by reading key from 2nd file
}
END {print s } # print all the values
In pure native bash (4.x):
# read each file into an array
IFS=, read -r -a values <parse.txt
IFS=, read -r -a ordering <refer.txt
# create a map from content before "/" to comma-separated full values in preserved order
declare -A kv=( )
for value in "${values[#]}"; do
key=${value%%/*}
if [[ ${kv[$key]} ]]; then
kv[$key]+=",$value" # already exists, comma-separate
else
kv[$key]="$value"
fi
done
# go through refer list, putting full value into "out" array for each entry
out=( )
for value in "${ordering[#]}"; do
out+=( "${kv[$value]}" )
done
# print "out" array in comma-separated form
IFS=,
printf '%s\n' "${out[*]}" >output.txt
If you're getting more output fields than you have input fields, you're probably trying to run this with bash 3.x. Since associative array support is mandatory for correct operation, this won't work.
tr , "\n" refer.txt | cat -n >person_id.txt # 'cut -n' not posix, use sed and paste
cat person_id.txt | while read person_id person_key
do
print "$person_id" > $person_key
done
tr , "\n" parse.txt | sed 's/(^[^\/]*)(\/.*)$/\1 \1\2/' >person_data.txt
cat person_data.txt | while read foreign_key person_data
do
person_id="$(<$foreign_key)"
print "$person_id" " " "$person_data" >>merge.txt
done
sort merge.txt >output.txt
A text book data processing approach, a person id table, a person data table, merged on a common key field, which is the first name of the person:
[person_key] [person_id]
- person id table, a unique sortable 'id' for each person (line number in this instance, since that is the desired sort order), and key for each person (their first name)
[person_key] [person_data]
- person data table, the data for each person indexed by 'person_key'
[person_id] [person_data]
- a merge of the 'person_id' table and 'person_data' table on 'person_key', which can then be sorted on person_id, giving the output as requested
The trick is to implement an associative array using files, the file name being the key (in this instance 'person_key'), the content being the value. [Essentially a random access file implemented using the filesystem.]
This actually adds a step to the otherwise simple but not very efficient task of grepping parse.txt with each value in refer.txt - which is more efficient I'm not sure.
NB: The above code is very unlikely to work out of the box.
NBB: On reflection, probably a better way of doing this would be to use the file system to create a random access file of parse.txt (essentially an index), and to then consider refer.txt as a batch file, submitting it as a job as such, printing out from the parse.txt random access file the data for each of the names read in from refer.txt in turn:
# 1) index data file on required field
cat person_data.txt | while read data
do
key="$(print "$data" | sed 's/(^[^\/]*)/\1/')" # alt. `cut -d'/' -f1` ??
print "$data" >>./person_data/"$key"
done
# 2) run batch job
cat refer_data.txt | while read key
do
print ./person_data/"$key"
done
However having said that, using egrep is probably just as rigorous a solution or at least for small datasets, I would most certainly use this approach given the specific question posed. (Or maybe not! The above could well prove faster as well as being more robust.)
Command
while read line; do
grep -w "^$line" <(tr , "\n" < parse.txt)
done < <(tr , "\n" < refer.txt) | paste -s -d , -
Key points
For both files, newlines are translated to commas using the tr command (without actually changing the files themselves). This is useful because while read and grep work under the assumption that your records are separated by newlines instead of commas.
while read will read in every name from refer.txt, (i.e julie, remo, etc.) and then use grep to retrieve lines from parse.txt containing that name.
The ^ in the regex ensures matching is only performed from the start of the string and not in the middle (thanks to #CharlesDuffy's comment below), and the -w option for grep allows whole-word matching only. For example, this ensures that "rob" only matches "rob/..." and not "robby/..." or "throb/...".
The paste command at the end will comma-separate the results. Removing this command will print each result on its own line.

How to extract one column of a csv file

If I have a csv file, is there a quick bash way to print out the contents of only any single column? It is safe to assume that each row has the same number of columns, but each column's content would have different length.
You could use awk for this. Change '$2' to the nth column you want.
awk -F "\"*,\"*" '{print $2}' textfile.csv
yes. cat mycsv.csv | cut -d ',' -f3 will print 3rd column.
The simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
csvtool format '%(2)\n' input.csv
Replacing 2 with the column number will effectively extract the column data you are looking for.
Landed here looking to extract from a tab separated file. Thought I would add.
cat textfile.tsv | cut -f2 -s
Where -f2 extracts the 2, non-zero indexed column, or the second column.
Here is a csv file example with 2 columns
myTooth.csv
Date,Tooth
2017-01-25,wisdom
2017-02-19,canine
2017-02-24,canine
2017-02-28,wisdom
To get the first column, use:
cut -d, -f1 myTooth.csv
f stands for Field and d stands for delimiter
Running the above command will produce the following output.
Output
Date
2017-01-25
2017-02-19
2017-02-24
2017-02-28
To get the 2nd column only:
cut -d, -f2 myTooth.csv
And here is the output
Output
Tooth
wisdom
canine
canine
wisdom
incisor
Another use case:
Your csv input file contains 10 columns and you want columns 2 through 5 and columns 8, using comma as the separator".
cut uses -f (meaning "fields") to specify columns and -d (meaning "delimiter") to specify the separator. You need to specify the latter because some files may use spaces, tabs, or colons to separate columns.
cut -f 2-5,8 -d , myvalues.csv
cut is a command utility and here is some more examples:
SYNOPSIS
cut -b list [-n] [file ...]
cut -c list [file ...]
cut -f list [-d delim] [-s] [file ...]
I think the easiest is using csvkit:
Gets the 2nd column:
csvcut -c 2 file.csv
However, there's also csvtool, and probably a number of other csv bash tools out there:
sudo apt-get install csvtool (for Debian-based systems)
This would return a column with the first row having 'ID' in it.
csvtool namedcol ID csv_file.csv
This would return the fourth row:
csvtool col 4 csv_file.csv
If you want to drop the header row:
csvtool col 4 csv_file.csv | sed '1d'
First we'll create a basic CSV
[dumb#one pts]$ cat > file
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
Then we get the 1st column
[dumb#one pts]$ awk -F , '{print $1}' file
a
1
a
1
Many answers for this questions are great and some have even looked into the corner cases.
I would like to add a simple answer that can be of daily use... where you mostly get into those corner cases (like having escaped commas or commas in quotes etc.,).
FS (Field Separator) is the variable whose value is dafaulted to
space. So awk by default splits at space for any line.
So using BEGIN (Execute before taking input) we can set this field to anything we want...
awk 'BEGIN {FS = ","}; {print $3}'
The above code will print the 3rd column in a csv file.
The other answers work well, but since you asked for a solution using just the bash shell, you can do this:
AirBoxOmega:~ d$ cat > file #First we'll create a basic CSV
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
a,b,c,d,e,f,g,h,i,k
1,2,3,4,5,6,7,8,9,10
And then you can pull out columns (the first in this example) like so:
AirBoxOmega:~ d$ while IFS=, read -a csv_line;do echo "${csv_line[0]}";done < file
a
1
a
1
a
1
a
1
a
1
a
1
So there's a couple of things going on here:
while IFS=, - this is saying to use a comma as the IFS (Internal Field Separator), which is what the shell uses to know what separates fields (blocks of text). So saying IFS=, is like saying "a,b" is the same as "a b" would be if the IFS=" " (which is what it is by default.)
read -a csv_line; - this is saying read in each line, one at a time and create an array where each element is called "csv_line" and send that to the "do" section of our while loop
do echo "${csv_line[0]}";done < file - now we're in the "do" phase, and we're saying echo the 0th element of the array "csv_line". This action is repeated on every line of the file. The < file part is just telling the while loop where to read from. NOTE: remember, in bash, arrays are 0 indexed, so the first column is the 0th element.
So there you have it, pulling out a column from a CSV in the shell. The other solutions are probably more practical, but this one is pure bash.
You could use GNU Awk, see this article of the user guide.
As an improvement to the solution presented in the article (in June 2015), the following gawk command allows double quotes inside double quoted fields; a double quote is marked by two consecutive double quotes ("") there. Furthermore, this allows empty fields, but even this can not handle multiline fields. The following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
FPAT="([^,\"]*)|(\"((\"\")*[^\"]*)*\")"
}
{
if (substr($c, 1, 1) == "\"") {
$c = substr($c, 2, length($c) - 2) # Get the text within the two quotes
gsub("\"\"", "\"", $c) # Normalize double quotes
}
print $c
}
' c=3 < <(dos2unix <textfile.csv)
Note the use of dos2unix to convert possible DOS style line breaks (CRLF i.e. "\r\n") and UTF-16 encoding (with byte order mark) to "\n" and UTF-8 (without byte order mark), respectively. Standard CSV files use CRLF as line break, see Wikipedia.
If the input may contain multiline fields, you can use the following script. Note the use of special string for separating records in output (since the default separator newline could occur within a record). Again, the following example prints the 3rd column (via c=3) of textfile.csv:
#!/bin/bash
gawk -- '
BEGIN{
RS="\0" # Read the whole input file as one record;
# assume there is no null character in input.
FS="" # Suppose this setting eases internal splitting work.
ORS="\n####\n" # Use a special output separator to show borders of a record.
}
{
nof=patsplit($0, a, /([^,"\n]*)|("(("")*[^"]*)*")/, seps)
field=0;
for (i=1; i<=nof; i++){
field++
if (field==c) {
if (substr(a[i], 1, 1) == "\"") {
a[i] = substr(a[i], 2, length(a[i]) - 2) # Get the text within
# the two quotes.
gsub(/""/, "\"", a[i]) # Normalize double quotes.
}
print a[i]
}
if (seps[i]!=",") field=0
}
}
' c=3 < <(dos2unix <textfile.csv)
There is another approach to the problem. csvquote can output contents of a CSV file modified so that special characters within field are transformed so that usual Unix text processing tools can be used to select certain column. For example the following code outputs the third column:
csvquote textfile.csv | cut -d ',' -f 3 | csvquote -u
csvquote can be used to process arbitrary large files.
I needed proper CSV parsing, not cut / awk and prayer. I'm trying this on a mac without csvtool, but macs do come with ruby, so you can do:
echo "require 'csv'; CSV.read('new.csv').each {|data| puts data[34]}" | ruby
I wonder why none of the answers so far have mentioned csvkit.
csvkit is a suite of command-line tools for converting to and working
with CSV
csvkit documentation
I use it exclusively for csv data management and so far I have not found a problem that I could not solve using cvskit.
To extract one or more columns from a cvs file you can use the csvcut utility that is part of the toolbox. To extract the second column use this command:
csvcut -c 2 filename_in.csv > filename_out.csv
csvcut reference page
If the strings in the csv are quoted, add the quote character with the q option:
csvcut -q '"' -c 2 filename_in.csv > filename_out.csv
Install with pip install csvkit or sudo apt install csvkit.
Simple solution using awk. Instead of "colNum" put the number of column you need to print.
cat fileName.csv | awk -F ";" '{ print $colNum }'
csvtool col 2 file.csv
where 2 is the column you are interested in
you can also do
csvtool col 1,2 file.csv
to do multiple columns
You can't do it without a full CSV parser.
If you know your data will not be quoted, then any solution that splits on , will work well (I tend to reach for cut -d, -f1 | sed 1d), as will any of the CSV manipulation tools.
If you want to produce another CSV file, then xsv, csvkit, csvtool, or other CSV manipulation tools are appropriate.
If you want to extract the contents of one single column of a CSV file, unquoting them so that they can be processed by subsequent commands, this Python 1-liner does the trick for CSV files with headers:
python -c 'import csv,sys'$'\n''for row in csv.DictReader(sys.stdin): print(row["message"])'
The "message" inside of the print function selects the column.
If the CSV file doesn't have headers:
python -c 'import csv,sys'$'\n''for row in csv.reader(sys.stdin): print(row[1])'
Python's CSV library supports all kinds of CSV dialects, so if your CSV file uses different conventions, it's possible to support them with relatively little change to the code.
Been using this code for a while, it is not "quick" unless you count "cutting and pasting from stackoverflow".
It uses ${##} and ${%%} operators in a loop instead of IFS. It calls 'err' and 'die', and supports only comma, dash, and pipe as SEP chars (that's all I needed).
err() { echo "${0##*/}: Error:" "$#" >&2; }
die() { err "$#"; exit 1; }
# Return Nth field in a csv string, fields numbered starting with 1
csv_fldN() { fldN , "$1" "$2"; }
# Return Nth field in string of fields separated
# by SEP, fields numbered starting with 1
fldN() {
local me="fldN: "
local sep="$1"
local fldnum="$2"
local vals="$3"
case "$sep" in
-|,|\|) ;;
*) die "$me: arg1 sep: unsupported separator '$sep'" ;;
esac
case "$fldnum" in
[0-9]*) [ "$fldnum" -gt 0 ] || { err "$me: arg2 fldnum=$fldnum must be number greater or equal to 0."; return 1; } ;;
*) { err "$me: arg2 fldnum=$fldnum must be number"; return 1;} ;;
esac
[ -z "$vals" ] && err "$me: missing arg2 vals: list of '$sep' separated values" && return 1
fldnum=$(($fldnum - 1))
while [ $fldnum -gt 0 ] ; do
vals="${vals#*$sep}"
fldnum=$(($fldnum - 1))
done
echo ${vals%%$sep*}
}
Example:
$ CSVLINE="example,fields with whitespace,field3"
$ $ for fno in $(seq 3); do echo field$fno: $(csv_fldN $fno "$CSVLINE"); done
field1: example
field2: fields with whitespace
field3: field3
You can also use while loop
IFS=,
while read name val; do
echo "............................"
echo Name: "$name"
done<itemlst.csv

Resources