CSV file parsing in Bash - bash

I have a CSV file with sample entries given below. What I want is to write a Bash script to read the CSV file line by line and put the first entry e.g 005 in one variable and the IP 192.168.10.1 in another variable, that I need to pass to some other script.
005,192.168.10.1
006,192.168.10.109
007,192.168.10.12
008,192.168.10.121
009,192.168.10.123

A more efficient approach, without the need to fork cut each time:
#!/usr/bin/env bash
while IFS=, read -r field1 field2; do
# do something with $field1 and $field2
done < file.csv
The gains can be quite substantial for large files.

Here's how I would do it with GNU tools :
while read line; do
echo $line | cut -d, -f1-2 --output-delimiter=' ' | xargs your_command
done < your_input.csv
while read line; do [...]; done < your_input.csv will read your file line by line.
For each line, we will cut it to its first two fields (separated by commas since it's a CSV) and pass them separated by spaces to xargs which will in turn pass as parameters to your_command.

If this is a very simple csv file with no string literals, etc. you can simply use head and cut:
#!/bin/bash
while read line
do
id_field=$(cut -d',' -f 1 <<<"$line") #here 005 for the first line
ip_field=$(cut -d',' -f 2 <<<"$line") #here 192.168.0.1 for the first line
#do something with $id_field and $ip_field
done < file.csv
The program works as follows: we use cut -d',' to obtain the first and second field of that line. We wrap this around a while read line and use I/O redirection to feed the file to the while loop.
Of course you substitute file.csv with the name of the file you want to process, and you can use other variable names than the ones in this sample.

Related

Why does "cut" command skip first line in this "while read line" loop?

I'm writing a bash script, and I need to take the second field of every line in a file, and save them in another file. I know there are many possible ways to do this, BUT, I tried first using while read line; do, and I got stuck. Now, I really want to know what is happening.
For example, input file would be:
line1 11111
line2 222222
line3 333
line4 4444
(The field separtor is "\t").
This is what I was doing:
inputfile=$1
cat $"inputfile" | while read -r line
do
cut -f2 >> results_file
done
The problem is, the output would be:
222222
333
4444
(skipping the first line)
I´ve alredy tested hundreds of modifications, and tried to used other commands instead of cut(like, sed, grep...). I would appreciate some help, or someone pointing me in the right direction.
Thank you very much!
You are not using the variable $line set by read. Try instead
inputfile=$1
cat "$inputfile" | while read -r line
do
echo "$line" | cut -f2 >> results_file
done
In your original code, the while loop is actually run only once, not four times; try putting echo 'Hello!' in the loop to your original code. You would see the message only once, not four times. I guess, without echo "$line" | part, cut -f2 ... part consumes the pipe away.
That is, your while loop first consumes the first line of the stdin and puts this line in the variable $line, leaving the next three lines for later use. But $line is never used. Instead, the remaining three lines are consumed by the command cut.
All commands within a command group are within the scope of any redirections applied to a command group (or any compound command):
— https://mywiki.wooledge.org/BashGuide/CompoundCommands
The pipe operator creates a subshell environment for each command.
— https://mywiki.wooledge.org/BashGuide/InputAndOutput
We can interpret the quotes as "the stdin to your while loop (i.e., the output of cat "$inputfile") is accessed by cut, unless you sever its access by creating a new subshell e.g., by another pipe echo "$line" | ...."
By the way, you can just use cut -f2 "$inputfile" >> results_file without the while loop.
With respect to your comment Does it mean to use "\t at the end" as a separator - no. You're confusing what was suggested, $'\t' with '\t$'. $'\t' means "the literal tab character generated from the escape sequence \t".
You also said in your comment your real 2nd fields are URLs to be curled. You shouldn't be using a UUOC and cut anyway, here's how to really do this:
while IFS=$'\t' read -r key url; do
val=$(curl "$url" | whatever)
printf '%s\t%s\n' "$key" "$val"
done < "$inputfile" > results_file
Replace whatever with whatever command you use to produce the output you want from the curl output.

How to process space at the head of line in file under shell?

I have a shell script to process old_file into new_file, I want to rewrite one line in old_file like this:
old_file:
123
abc
e45
new_file:
123
abc bca
e45
So I use codes like this:
cat old_file | while read line; do process "$line" > new_file; done
The process is a function to match and rewrite the line. But I found the space in the head of line in old_file is missing like this:
new_file:
123
abc bca
e45
So I change code into this:
echo "`cat old_file`" | while read line; do echo "$line" > new_file; done
But the lines in old_file will be one line(as "" works, sometimes this will got "File name too long" error), I want to process lines one by one, not whole lines together.
So how to process lines in old_file one by one, and at the same time, keep space in the head of line?
Thank you~
Given your edit, there are several issues. First, you are using the wrong tool for the job. If you want to change the line ' abc' to ' abc bca', then sed is the correct tool. For example, given old_file,
$ cat old_file
123
abc
e45
You can accomplish what you want with a simple
$ sed 's/^\s*abc$/& bca/' old_file
123
abc bca
e45
To edit the file in place, just add sed -i.bak to change the file in place while saving old_file.bak containing the original. (you can omit .bak and skip creating the backup)
Next, your command cat old_file is an Unnecessary Use Of cat (a UUOc). Simply redirect the file to your loop, e.g.
while read -r line; do echo "$line"; done <old_file
note: read skips leading whitespace. To preserve, if your shell (e.g. bash, etc.) provides an Internal Field Separator for word-splitting, setting IFS= will preserve leading whitespace, e.g.
while IFS= read -r line; do echo "$line"; done <old_file
Otherwise, check the options of your read implementation.
Next, after you process "$line" to add the ' bca' you can simply redirect all output to new_file. Running the while loop in a subshell allows you to redirect the entire output to new_file, e.g.
(while IFS= read -r line; do process "$line"; done <old_file) >new_file
Let me know if you have any further questions.

Remove everything in a pipe delimited file after second-to-last pipe

How can remove everything in a pipe delimited file after the second-to-last pipe? Like for the line
David|3456|ACCOUNT|MALFUNCTION|CANON|456
the result should be
David|3456|ACCOUNT|MALFUNCTION
Replace |(string without pipe)|(string without pipe) at the end of each line:
sed 's/|[^|]*|[^|]*$//' inputfile
Using awk, something like
awk -F'|' 'BEGIN{OFS="|"}{NF=NF-2; print}' inputfile
David|3456|ACCOUNT|MALFUNCTION
(or) use cut if you know the number of columns in total, i,e 6 -> 4
cut -d'|' -f -4 inputfile
David|3456|ACCOUNT|MALFUNCTION
The command I would use is
cat input.txt | sed -r 's/(.*)\|.*/\1/' > output.txt
A pure Bash solution:
while IFS= read -r line || [[ -n $line ]] ; do
printf '%s\n' "${line%|*|*}"
done <inputfile
See Reading input files by line using read command in shell scripting skips last line (particularly the answer by Jahid) for details of how the while loop works.
See pattern matching in Bash for information about ${line%|*|*}.

Extract first word in colon separated text file

How do i iterate through a file and print the first word only. The line is colon separated. example
root:01:02:toor
the file contains several lines. And this is what i've done so far but it does'nt work.
FILE=$1
k=1
while read line; do
echo $1 | awk -F ':'
((k++))
done < $FILE
I'm not good with bash-scripting at all. So this is probably very trivial for one of you..
edit: variable k is to count the lines.
Use cut:
cut -d: -f1 filename
-d specifies the delimiter
-f specifies the field(s) to keep
If you need to count the lines, just
count=$( wc -l < filename )
-l tells wc to count lines
awk -F: '{print $1}' FILENAME
That will print the first word when separated by colon. Is this what you are looking for?
To use a loop, you can do something like this:
$ cat test.txt
root:hello:1
user:bye:2
test.sh
#!/bin/bash
while IFS=':' read -r line || [[ -n $line ]]; do
echo $line | awk -F: '{print $1}'
done < test.txt
Example of reading line by line in bash: Read a file line by line assigning the value to a variable
Result:
$ ./test.sh
root
user
A solution using perl
%> perl -F: -ane 'print "$F[0]\n";' [file(s)]
change the "\n" to " " if you don't want a new line printed.
You can get the first word without any external commands in bash like so:
printf '%s' "${line%%:*}"
which will access the variable named line and delete everything that matches the glob :* and do so greedily, so as close to the front (that's the %% instead of a single %).
Though with this solution you do need to do the loop yourself. If this is the only thing you want to do with the variable the cut solution is better so you don't have to do the file iteration yourself.

sed - unterminated `s' command

I have this peace of code:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}') ;
name=$(awk '{print $2}') ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
file BP.csv has this format:
GO:0008283 cell proliferation
GO:0009405 pathogenesis
GO:0010201 response to continuous far red light stimulus by the high-irradiance response system
GO:0009641 shade avoidance
while GOEA.csv has this format:
4577 GO:0006807 0.994 2014_06_01
4577 GO:0016788 0.989 2014_06_01
4577 GO:0043169 0.977 2014_06_01
4577 GO:0043170 0.963 2014_06_01
sed doesn't work. I want to change GO:0043170 for example, to string "pi", but it gives:
sed: -e expression #1, char 12: unterminated `s' command
Why?
Thanks.
You running your awk command against no input, Try this:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}' <<< "$line") ;
name=$(awk '{print $2}' <<< "$line" ) ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
Let's clean up this code a bit:
while read goterm name
do
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g"
done < BP.cvs
The problem is that your awk statements are attempting to read in from STDIN just like your while is doing. You're reading from the same input stream.
What you want to do is to pull out the values from your line. I'm using read to do this. The read statement uses the values in $IFS to separate out the input. This is normally spaces, tabs, and newlines. The read reads each variable you put on the line, and the last value read in contains the entire rest of the line.
Thus:
while read line
reads in the entire line while:
while goterm name
will break the line as
goterm="GO:0008283"
name="cell proliferation"
One more thing. When you use grep and sed together, you probably can get away with just sed:
while read goterm name
do
sed -n "/$goterm/s/$goterm/pi/gp" GOEA.csv
done < BP.csv
The format for the sed command is:
/lines/command/parameters/
So, I'm searching for lines with $goterm in them, then I am replacing $goterm with pi. The -n means don't print out the lines as sed processes them and p means to print out the lines were the substitute is located.
By the way, csv as a file suffix means comma separated values but neither file looks like it is comma separated. Are these tabs separating each field. If so, you'll need to modify $IFS to be tabs.
I would restructure that whole thing more like this:
while read goterm restofline
do
grep -w "${goterm}" GOEA.csv | sed -e "s/${goterm}/pi/g"
done < BP.csv
No reason for the awk things, as the bash read builtin will do rudimentary field splitting for you if you give it multiple variables. Also, you aren't using name anyway, so it's not needed. cat is unnecessary as well.
Depending on your exact use case, even the grep may be unnecessary, making the inner command simply sed -ne "s/${goterm}/pi/gp" GOEA.csv. Unless your purpose for the grep -w is eliminating lines where ${goterm} is a substring of a word instead of the whole word...
For future reference, inserting a set -x above your loop in your script would show you the exact commands that are being run, so that you can compare them with your expectations.

Resources