Parsing key value in an csv file using shell script - shell

Given csv input file
Id Name Address Phone
---------------------
100 Abc NewYork 1234567890
101 Def San Antonio 9876543210
102 ghi Chicago 7412589630
103 GHJ Los Angeles 7896541259
How do we grep/command for the value using the key?
if Key 100, expected output is NewYork

You can try this:
grep 100 filename.csv | cut -d, -f3
Output:
New York
This will search the whole file for the value 100, and return all the values in the 3rd column of the matching rows.

With GNU grep:
grep -Po '^100.....\K...........' file
or shorter:
grep -Po '^100.{5}\K.{11}' file
Output:
NewYork

Awk splits lines by whitespace sequences (by default).
You could use that to write a condition on the first column.
In your example input, it looks like not CSV but columns with fixed width (except the header). If that's the case, then you can extract the name of the city as a substring:
awk '$1 == 100 { print substr($0, 9, 11); }' input.csv
Here 9 is the starting position of the city column, and 11 is its length.
If on the other hand your input file is not what you pasted, but really CSV (comma separated values), and if there are no other embedded commas or newline characters in the input, then you can write like this:
awk -F, '$1 == 100 { print $3 }' input.csv

Related

How to convert the first letter of all words of several columns of a csv file to uppercase while making the rest of the letters lowercase?

Bash 4.4.0
Ubuntu 16.04
I have several columns in a CSV file that are all capital letters and some are lowercase. Some columns have only one word while others may have 50 words. At this time, I convert column by column with 2 commands and it is quite taxing on the server when the file has 50k lines.
Example:
#-- Place the header line in a temp file
head -n 1 "$tmp_input1" > "$tmp_input3"
#-- Remove the header line in orginal file
tail -n +2 "$tmp_input1" > "$tmp_input1-temp" && mv "$tmp_input1-temp" "$tmp_input1"
#-- Change the words in the 11th column to lower case then change the first leter to upper case
awk -F"," 'BEGIN{OFS=","} {$11 = tolower($11); print}' "$tmp_input4" > "$tmp_input5"
sed -i "s/\b\(.\)/\u\1/g" "$tmp_input5"
#-- Change the words in the 12th column to lower case then change the first leter to upper case
awk -F"," 'BEGIN{OFS=","} {$12 = tolower($12); print}' "$tmp_input5" > "$tmp_input6"
sed -i "s/\b\(.\)/\u\1/g" "$tmp_input6"
#-- Change the words in the 13th column to lower case then change the first leter to upper case
awk -F"," 'BEGIN{OFS=","} {$13 = tolower($13); print}' "$tmp_input6" > "$tmp_input7"
sed -i "s/\b\(.\)/\u\1/g" "$tmp_input7"
cat "$tmp_input7" >> "$tmp_input3"
Is it possible to do multiple columns in a single command?
Here is an example of the csv file:
"dealer_id","vin","conditon","stocknumber","make","model","year","broken","trim","bodystyle","color","interiorcolor","interiorfabric","engine","enginedisplacement","engineaspiration","engineText","transmission","drivetrain","mpgcity","mpghighway","mileage","cylinders","fuelconditon","optiontext","description","titlestatus","warranty","price","specialprice","window_sticker_price","mirrorhangerprice","images","ModelCode","PackageCodes"
"JOHNVANC04A","2C4RC1N73JR290946","N","JR290946","Chrysler","Pacifica","2018","","Hybrid Limited FWD","Mini-van, Passenger","Brilliant BLACK Crystal PEARL Coat","","..LEATHER SEATS..","V6 Cylinder Engine","3.6L","","","AUTOMATIC","FWD","0","0","553","6","H","..1-SPEED A/T..,..AUTO-OFF HEADLIGHTS..,..BACK-UP CAMERA..,..COOLED DRIVER SEAT..,..CRUISE CONTROL..","======KEY FEATURES INCLUDE: . LEATHER SEATS. THIRD ROW SEAT. QUAD BUCKET SEATS. REAR AIR. HEATED DRIVER SEAT.","","0","41680","","48830","","http://i.autoupktech.com/c640/9c40231cbcfa4ef89425d108e4e3a410.jpg",http://i.autoupnktech.com/c640/9c40231cbcfa4ef89425d108e4e3a410.jpg","RUES53","AAX,AT2,DFQ,EH3,GWM,WPU"
Here's a snippet of the above columns refined
Column 11 should be - "Brilliant Black Crystal Pearl Coat"
Column 13 should be - "Leather Seats"
Column 16 should be - "Automatic"
Column 23 should be - "1-Speed A/T,Auto-Off Headlights,Back-up Camera"
Column 24 should be - "Key Features Include: Leather Seats,Third Row Seat"
Keep in mind, the double-quotes surrounding the columns can't be stripped. I only need to convert certain columns and not the entire file. Here's an example of the columns 11, 13, 16, 23 and 24 converted.
"Brilliant Black Crystal Pearl Coat","Leather Seats","Automatic","1-Speed A/T,Auto-Off Headlights,Back-up Camera","Key Features Include: Leather Seats,Third Row Seat"
Just to add another option, here is a one liner using just sed:
sed -i -e 's/.*/\L&/' -e 's/[a-z]*/\u&/g' filename
And here is a proof of concept:
$ cat testfile
jUSt,a,LONG,list of SOME,RAnDoM WoRDs
ANother LIne
OneMore,LiNe
$ sed -e 's/.*/\L&/' -e 's/[a-z]*/\u&/g' testfile
Just,A,Long,List Of Some,Random Words
Another Line
Onemore,Line
$
If you want to convert just the headers of the CSV file (first line), just replace s with 1s on both search patterns.
You can find an excellent article explaining the magic here: sed – Convert to Title Case.
Here is another alternative (off-topic here, I know) in Python 3:
import csv
from pathlib import Path
infile = Path('infile.csv')
outfile = Path('outfile.csv')
titled_cols = [10, 12, 15, 22, 23]
titled_data = []
with infile.open() as fin, outfile.open('w', newline='') as fout:
for row in csv.reader(fin, quoting=csv.QUOTE_ALL):
for i,col in enumerate(row):
if i in titled_cols:
col = col.title()
titled_data.append(row)
csv.writer(fout, quoting=csv.QUOTE_ALL).writerows(titled_data)
Just define the columns you want to be title cased on titled_cols (columns have zero based indexes) and it will do what you want.
I guess infile and outfile are self-explanatory and outfile will contain the modified version of your original file.
I hope it helps.
You could create a user-defined function and apply it to the columns you need to modify.
awk -F, 'function toproper(s) { return toupper(substr(s, 1, 1)) tolower(substr(s, 2, length(s))) } {printf("%s,%s,%s,%s\n", toproper($1), toproper($2), toproper($3), toproper($4));}'
Input:
FOO,BAR,BAZ,ETC
Output:
Foo,Bar,Baz,Etc
Assuming the fields of the csv file are not quoted by double quotes,
meaning that we can simply split a record on commas and whitespaces, how
about a Perl solution:
perl -pe 's/(^|(?<=[,\s]))([^,\s])([^,\s]*)((?=[,\s])|$)/\U$2\L$3/g' input.csv
input.csv:
Bash,4.4.0,Ubuntu,16.04
I have several columns in a CSV file,that, are, all capital letters
and some are lowercase.
Some columns have only,one,word,while others may have 50 words.
output:
Bash,4.4.0,Ubuntu,16.04
I Have Several Columns In A Csv File,That, Are, All Capital Letters
And Some Are Lowercase.
Some Columns Have Only,One,Word,While Others May Have 50 Words.
This version uses AWK to do the job:
This is the command (change file to your filename)
awk -F"," 'BEGIN{OFS=","}{ for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""tolower(substr($i,2,length($i)))}print $0}' file | awk -F" " 'BEGIN{OFS=" "} { for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""substr($i,2,length($i))}print $0}'
The test:
cat file
pepe is cool,ASDASD ASDAS,and no podpoiaops
awk -F"," 'BEGIN{OFS=","}{ for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""tolower(substr($i,2,length($i)))}print $0}' file | awk -F" " 'BEGIN{OFS=" "} { for (i=1; i<=NF; i++) { $i=toupper(substr($i,1,1))""substr($i,2,length($i))}print $0}'
Pepe Is Cool,Asdasd Asdas,And No Podpoiaops
Explanation
BEGIN{OFS=","} tells awk how to outuput the line.
The for statement uses NF, the built in internal variable for the
number of fields for each line
The substr divide and change the first letter of the field, and it's assigned to its line value again
All row is printed print $0
Finally, the second awk divides the lines created on the first example, but this time dividing with spaces as separator. This way, It detects all different words on the file, and changes every first Character of them.
Hope it helps

print 1st string of a line if last 5 strings match input

I have a requirement to print the first string of a line if last 5 strings match specific input.
Example: Specified input is 2
India;1;2;3;4;5;6
Japan;1;2;2;2;2;2
China;2;2;2;2
England;2;2;2;2;2
Expected Output:
Japan
England
As you can see, China is excluded as it doesn't meet the requirement (last 5 digits have to be matched with the input).
grep ';2;2;2;2;2$' file | cut -d';' -f1
$ in a regex stands for "end of line", so grep will print all the lines that end in the given string
-d';' tells cut to delimit columns by semicolons
-f1 outputs the first column
You could use awk:
awk -F';' -v v="2" -v count=5 '
{
c=0;
for(i=2;i<=NF;i++){
if($i == v) c++
if(c>=count){print $1;next}
}
}' file
where
v is the value to match
count is the maximum number of value to print the wanted string
the for loop is parsing all fields delimited with a ; in order to find a match
This script doesn't need the 5 values 2 to be consecutive.
With sed:
sed -n 's/^\([^;]*\).*;2;2;2;2;2$/\1/p' file
It captures and output non ; first characters in lines ending with ;2;2;2;2;2
It can be shortened with GNU sed to:
sed -nE 's/^([^;]*).*(;2){5}$/\1/p' file
awk -F\; '/;2;2;2;2;2$/{print $1}' file
Japan
England

Extracting one column of a text file to another when a pattern is matched

I have a tab-separated text file that has 4 columns of data:
StudentId Student Name GPA Major
I have to write a shell command that will stores the student names that are CS majors to another file. I used grep cs students.txt which works to display just students that are cs, but I do not know how to then take just the student's names and save them to a file.
Assuming that your input file is tab-separated (so you can have spaces in names):
awk -F'\t' '$4 == "cs" { print $2 }' <infile >outfile
This matches column 4 (major) against "cs", and prints column 2 when it is an exact match.
Got it:
grep cs students.txt | cut -f2 >file1

add row number to the last column using awk or bash

Input file format:
name id department
xyz 20 cic
abc 25 cis
Output should look like:
name id department
xyz 20 cic 1
abc 25 cis 2
Note: all the fields are tab separated.
Appreciate any help!!
$ awk -F'\t' 'NR>1{$0=$0"\t"NR-1} 1' file
name id department
xyz 20 cic 1
abc 25 cis 2
You should try this:
awk '{printf "%s\t%s\n",$0,NR}' File_name
Explanation:
$0 = print all the lines
NR = Adds number at each line
%s = for printing a literal character
\t = hozintal tab
\n = new line
A variation on Ed Morton's answer:
awk -F'\t' -v OFS='\t' 'NR>1 { $(NF+1)=NR-1} 1' file
This sets the output field separator using the -v option, then simply adds a new field to the current record by setting $(NR+1).

extracting whole line in awk if condition using cut -d

I want to get the value of the 11th column in my tab delimited file.
This return value is multiple values concetenated using : as seperator.
example result from cat myFile | cut -d':' :
.:7:.:2:100:.
I now want to split this file on the : seperator and retrieve the second value.
This can be done with cut -d':' -f2
my question:
How can I make a statement which returns all lines in my file which have value 5 or more in the second part of the 11th column?
input file (2 lines):
chr1 4396745 bnd_549 a a[chr9:136249370[ 100 PASS SVTYPE=BND;MATEID=bnd_550;EVENT=transl_inter_1022;GENE=; GT:AD:DP:SS:SSC:BQ .:.:.:.:.:. .:7:.:2:100:.
chr1 6315381 bnd_551 c ]chr9:68720182]c 100 PASS SVTYPE=BND;MATEID=bnd_552;EVENT=transl_inter_9346;GENE=; GT:AD:DP:SS:SSC:BQ .:.:.:.:.:. .:3:.:2:100:.
expected output:
chr1 4396745 bnd_549 a a[chr9:136249370[ 100 PASS SVTYPE=BND;MATEID=bnd_550;EVENT=transl_inter_1022;GENE=; GT:AD:DP:SS:SSC:BQ .:.:.:.:.:. .:7:.:2:100:.
output with (awk -F: '$11>=5' example.sorted.vcf):
no output
This should work (though untested, please provide input and expected output):
awk '{split($11,ary,/:/); if(ary[2]>=5) print}' myFile
You could also use whitespace or colon as the field separator:
awk -F ':|[[:blank:]]+' '$23 > 5' filename

Resources