code to identify values in comma separated value file - shell

i want to parse a csv file in a shell script. I want to input the name of the file at the prompt. like :
somescript.sh filename
Can it be done?
Also, I will read user input to display a particular a particular data number in the csv.
For example, say the csv file has 10 values in each line:
1,2,3,4,5,6,7,8,9,10
And I want to read the 5th value. how can I do it?
And multiple lines are involved.
Thanks.

If your file is really in such a simple format (just commas, no spaces), then cut -d, -f5 would do the trick.

#!/bin/sh
awk -F, "NR==$2{print \$$3}" "$1"
Usage:
./test.sh FILENAME LINE_NUMBER FIELD_NUMBER

Related

How to add a header to text file in bash?

I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

for loop in shell taking input from column name of a tab delimited file

I want to perform a iterative task on a bunch of similar files with similar partial files names contain "{$ID}". The program is running for each of the separate files manually. I wanted to write a for loop in shell to take 'ID' from a txt file and do the task. I am unable to give the input argument ID in the for loop from the values in coloumn1 of test.txt file
Eaxmple:
for ID in 'values in coloumn1 of test.txt file'
do
file1=xyz_{$ID}.txt
file2=abc_{$ID}.txt
execute the defined function using file1 and file2
write to > out_{$ID}.txt
done
Iterate over the file with a while loop, letting read both input a line from the file and splitting the first column off each line.
while read -r ID rest; do
...
done < test.txt
If the columns are delimited by a character other than whitespace, use IFS to specify that single character.
# comma-delimited files
while IFS=, read -r ID rest; do
...
done < test.txt

Edit files in Bash

I have a few files that contain IP addresses. I'm creating a script and have to figure out how to create a new user file with an IP address that is based off the file created before it. If the last file contains an IP of A.B.C.D the new file needs to be A.B.C.(D+4).
I think I need to use the 'sed' and 'awk' commands, but haven't been able to get anything working. How would I go about writing this part of the script?
Here's something to get you started: suppose there is a file called input looks like this:
Input: contents of input
127.0.0.1
127.0.0.2
127.0.0.3
127.0.0.200
You can do on the cmdline:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input > output
Explanation on what awk is doing here:
awk '...' - invoke awk, a tool used primarily for line-by-line manipulation of files, the stuff enclosed by single quotes are instructions to awk.
BEGIN{FS=OFS="."} - tell awk to use . as the delimiter for both input and output. FS stands for "Field Separator"
{$4=$4+4; print} - $4 means the 4th field. Since . is the delimiter, D corresponds to the 4th field and we add the integer value 4 to the 4th field. The print here is just short hand for printing the entire line.
input - name the input file as argument to awk; save a cat
> output - redirect the output to a file so you can inspect them for any issues before making the user files based on it.
Output: contents of output
127.0.0.5
127.0.0.6
127.0.0.7
127.0.0.204
And then you can read output one line at a time to create new user files as needed, maybe another script with something along the lines of:
while read line
do
echo "this is a user file" > "$line"
done < output
(and adjust it to your needs)
Finally, as long as you understand what's going on in the above, you can skip the output file altogether and just do this all in a one-liner:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input | while read line; do echo "hello world" > "$line"; done

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Resources