Create CSV from specific columns in another CSV using shell scripting - bash

I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.

The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.

Related

How to display the latest line based on the file's name or the line's position in bash

I have a tricky question about how to keep the latest log data as my server reposted it two times
This is the result after I grep from my folder :(i have tons of data, just to keep it simpler)
...
20150630-201427.csv:20150630,CFIIASU,233,96.21786,0.44644,
20150630-201427.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150630-201427.csv:20150630,CFIIASU_CN,68,102.19569,0.10692
20150630-201427.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150630-201427.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150630-201427.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
...
The data actually came from many csv files, I only pick two csv files to make the example, and here are some explainations of this:
the example came from two files 20150630-201427.csv and 20150701-151654.csv, and it has 4 columns which correspond to date, datanme, data_column1, data_column2, data_column3.
these line have the same data date 20150630 and the same dataname CFIIASU,CFIIASU_AU...etc, but the numbers in the fourth and fifth column (which are data_column2 and data_column3) are different.
How could i keep the data of 20150701-151654.csv based on the file's name and data date and apply it on my whole data set?
To make it more clearly. I'd like to keep the lines of "the latest csv" and since the latest csv is corresponding to the file's name, which in this example is 2015070. but when it comes to my whole data set i need to handle with so many 20xxxxxx.csv that i can't check it one by one.
for the example, i made this should end up like this:
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
Thanks in advance.
Your question isn't clear but it sounds like this might be what you're trying to do (print all lines from the last csv mentioned in the input file):
$ tac file | awk -F':' 'NR>1 && $1!=prev{exit} {print; prev=$1}' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
or maybe this (print the last line seen for every 20150630,CFIIASU etc. pair in the input file):
$ tac file | awk -F'[:,]' '!seen[$2,$3]++' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743

Add missing columns to CSV file?

Starting Question
I have a CSV file which is formed this way (variable.csv)
E,F,G,H,I,J
a1,
,b2,b3
c1,,,c4,c5,c6
As you can see, the first and second columns do not have all the commas needed. Here's what I want:
E,F,G,H,I,J
a1,,,,,
,b2,b3,,,
c1,,,c4,c5,c6
With this, now every row has the right number of columns. In other words, I'm looking for a unix command which smartly appends the correct number of commas to the end of each row to make the row have the number of columns that we expect, based off the header.
Here's what I tried, based off of some searching:
awk -F, -v OFS=, 'NF=6' variable.csv. This works in the above case, BUT...
Final Question
...Can we have this command work if the column data contains commas itself, or even new line characters? e.g.
E,F,G,H,I,J
"a1\n",
,b2,"b3,3"
c1,,,c4,c5,c6
to
E,F,G,H,I,J
"a1\n",,,,,
,b2,"b3,3",,,
c1,,,c4,c5,c6
(Apologies if this example's formatting is malformed due to the way the newline is represented.
Short answer:
python3 -c 'import fileinput,sys,csv;b=list(csv.reader(fileinput.input()));w=max(len(i)for i in b);print("\n".join([",".join(i+[""]*(w-len(i)))for i in b]))' variable.csv
The python script may be long, but this is to ensure that all cases are handled. To break it down:
import fileinput,csv
b=list(csv.reader(fileinput.input())) # create a reader obj
w=max(len(i)for i in b) # how many fields?
print("\n".join([",".join(i+[""]*(w-len(i)))for i in b])) # output
BTW, in your starting problem
awk -F, -v OFS=, 'NF<6{$6=""}1' variable.csv
should work. (I think it's implementation or version related. Your code works on GNU awk but not on Mac version.)

Populate a value in a particular column in csv

I have a folder where there are 50 excel sheets in CSV format. I have to populate a particular value say "XYZ" in the column I of all the sheets in that folder.
I am new to unix and have looked for a couple of pages Here and Here . Can anyone please provide me the sample script to begin with?
For example :
Let's say column C in this case:
A B C
ASFD 2535
BDFG 64486
DFGC 336846
I want to update column C to value "XYZ".
Thanks.
I would export those files into csv format
- with semikolon as field separator
- eventually by leaving out column descriptions (otherwise see comment below)
Then the following combination of SHELL and SED script could more or less do already the trick
#! /bin/sh
for i in *.csv
do
sed -i -e "s/$/;XZY/" $i
done
-i means to edit the file in place, here you could append the value to all lines
-e specifies the regular expresssion for substitution
You might want to use a similar script like this, to rename "XYZ" to "C" only in the 1st line if the csv files should contain also column descriptions.

Finding a newline in the csv file

I know there are a lot of questions about this (latest one here.), but almost all of them are how to join those broken lines into one from a csv file or remove them. I don't want to remove, but I just want to display/find that line (or probably the line number?)
Example data:
22224,across,some,text,0,,,4 etc
33448,more,text,1,,3,,,4 etc
abcde,text,number,444444,0,1,,,, etc
358890,more
,text,here,44,,,, etc
abcdefg,textds3,numberss,413,0,,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
More search on this, and I know I shouldn't use bash to accomplish this, but rather shoud use perl. I tried (from various website, I don't know perl), but apparently I don't have the Text::CSV package and I don't have permission to install one.
As I told I have no idea how to even start looking for this, so I don't have any script. This is not a windows file, this is very much unix file so we can ignore the CR problem.
Desired output:
358890,more
,text,here,44,,,, etc
985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
or
Line 4: 358890,more
,text,here,44,,,, etc
Line 7: 985678,93838,text,,,,
,text,continuing,from,previous,line,,, etc
Much appreciated.
You can use perl to count the number of fields(commas), and append the next line until it reaches the correct number
perl -ne 'if(tr/,/,/<28){$line=$.;while(tr/,/,/<28){$_.=<>}print "Line $line: $_\n"}' file
I do love Perl but I don't think it is the best tool for this job.
If you want a report of all lines that DO NOT have exactly the correct number of commas/delimiters, you could use the unix language awk.
For example, this command:
/usr/bin/awk -F , 'NF != 8' < csv_file.txt
will print all lines that DO NOT have exactly 7 commas. Comma is specified as the Field with -F and the Number of Fields is specified with NF.

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv

Resources