Add missing columns to CSV file? - bash

Starting Question
I have a CSV file which is formed this way (variable.csv)
E,F,G,H,I,J
a1,
,b2,b3
c1,,,c4,c5,c6
As you can see, the first and second columns do not have all the commas needed. Here's what I want:
E,F,G,H,I,J
a1,,,,,
,b2,b3,,,
c1,,,c4,c5,c6
With this, now every row has the right number of columns. In other words, I'm looking for a unix command which smartly appends the correct number of commas to the end of each row to make the row have the number of columns that we expect, based off the header.
Here's what I tried, based off of some searching:
awk -F, -v OFS=, 'NF=6' variable.csv. This works in the above case, BUT...
Final Question
...Can we have this command work if the column data contains commas itself, or even new line characters? e.g.
E,F,G,H,I,J
"a1\n",
,b2,"b3,3"
c1,,,c4,c5,c6
to
E,F,G,H,I,J
"a1\n",,,,,
,b2,"b3,3",,,
c1,,,c4,c5,c6
(Apologies if this example's formatting is malformed due to the way the newline is represented.

Short answer:
python3 -c 'import fileinput,sys,csv;b=list(csv.reader(fileinput.input()));w=max(len(i)for i in b);print("\n".join([",".join(i+[""]*(w-len(i)))for i in b]))' variable.csv
The python script may be long, but this is to ensure that all cases are handled. To break it down:
import fileinput,csv
b=list(csv.reader(fileinput.input())) # create a reader obj
w=max(len(i)for i in b) # how many fields?
print("\n".join([",".join(i+[""]*(w-len(i)))for i in b])) # output
BTW, in your starting problem
awk -F, -v OFS=, 'NF<6{$6=""}1' variable.csv
should work. (I think it's implementation or version related. Your code works on GNU awk but not on Mac version.)

Related

remove line in csv file if string found (from another text file) in bash

Due to a power failure issue, I am having to clean up jobs which are run based on text files. So the problem is, I have a text file with strings like so (they are uuids):
out_file.txt (~300k entries)
<some_uuidX>
<some_uuidY>
<some_uuidZ>
...
and a csv like so:
in_file.csv (~500k entries)
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location3/,<some_uuidX>.json.<some_string3>
/path/to/some/location4/,<some_uuidY>.json.<some_string4>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
/path/to/some/location6/,<some_uuidZ>.json.<some_string6>
...
I would like to remove lines from out_file for entries which match in_file.
The end result:
/path/to/some/location1/,<some_uuidK>.json.<some_string1>
/path/to/some/location2/,<some_uuidJ>.json.<some_string2>
/path/to/some/location5/,<some_uuidN>.json.<some_string5>
...
Since the file sizes are fairly large, I was wondering if there is an efficient way to do it in bash.
any tips would be geat.
Here is a potential grep solution:
grep -vFwf out_file.txt in_file.csv
And a potential awk solution (likely faster):
awk -F"[,.]" 'FNR==NR { a[$1]; next } !($2 in a)' out_file.txt in_file.csv
NB there are caveats to each of these approaches. Although they both appear to be suitable for your intended purpose (as indicated by your comment "the numbers add up correctly"), posting a minimal, reproducible example in future questions is the best way to help us help you.

How to use file chunks based on characters instead of lines for grep?

I am trying to parse log files of the form below:
---
metadata1=2
data1=2,data3=5
END
---
metadata2=1
metadata1=4
data9=2,data3=2, data0=4
END
Each section between the --- and END is an entry. I want to select the entire entry that contains a field such as data1. I was able to solve it with the following command, but it is painfully slow.
pcregrep -M '(?s)[\-].*data1.*END' temp.txt
What am I doing wrong here?
Parsing this file with pcregrep might be challenging. The 'pcregrep' does not have the ability to break the files into logical records. The pattern that was specific will try to find matching records by combining multiple record together. Sometimes including unmatched records in the output.
For example, if the input is "--- data=a END --- data1=a END", then the above command will select both records, as it will form a match between the initial '---', and the trailing 'END'
For this kind of input, consider using AWK. It has the ability to read input with custom record separator (RS), which make it easy to convert the input into records, and apply the pattern. If you prefer, you can use Perl or Python.
Using awk RS to create "records", possible to apply the pattern test on every record
awk -v RS='END\n' '/data1/ { print $0 }' < log1
awk -v RS='END\n' '/data1/ { print NR, $0 }' < log1
The second command include the record number in the output, if useful.
While AWK is not as fast as pcregrep, in this case, it will not have trouble processing large input set.
I would use awk:
awk 'BEGIN{RS=ORS="END\n"}/\ydata1/' file
Explanation:
awk works based on input records. By default such a record is a line of input, but this behaviour can be changed by setting the record separator (and output record separator for the output).
By setting them to END\n, we can search whole records of your input.
The regular expression /\ydata1/ searches those records for the presence of the the term data1, the \y matches a word boundary, to prevent from matching metadata1.

Awk multiplication gives zero

I am a bit new to using awk. My goal is to create a bash function of the form:
myfunction file column value
That takes the given column number in file, multiplies it by value and rewrites the file. For now I have written the following:
function multiply_column {
file=$1
column=$2
value=$3
awk -F" " '{print $col*mul}' col=$column mul=$value $file
}
My file looks like this:
0.400000E+15 0.168933E+00 -0.180294E-44 0.168933E+00
0.401000E+15 0.167689E+00 -0.181383E-44 0.167689E+00
0.402000E+15 0.166502E+00 -0.182475E-44 0.166502E+00
0.403000E+15 0.165371E+00 -0.183569E-44 0.165371E+00
0.404000E+15 0.164298E+00 -0.184666E-44 0.164298E+00
0.405000E+15 0.163284E+00 -0.185766E-44 0.163284E+00
0.406000E+15 0.162328E+00 -0.186868E-44 0.162328E+00
0.407000E+15 0.161431E+00 -0.187972E-44 0.161431E+00
0.408000E+15 0.160593E+00 -0.189080E-44 0.160593E+00
0.409000E+15 0.159816E+00 -0.190189E-44 0.159816E+00
0.410000E+15 0.159099E+00 -0.191302E-44 0.159099E+00
0.411000E+15 0.158442E+00 -0.192416E-44 0.158442E+00
0.412000E+15 0.157847E+00 -0.193534E-44 0.157847E+00
0.413000E+15 0.157312E+00 -0.194653E-44 0.157312E+00
0.414000E+15 0.156840E+00 -0.195775E-44 0.156840E+00
0.415000E+15 0.156429E+00 -0.196899E-44 0.156429E+00
0.416000E+15 0.156081E+00 -0.198026E-44 0.156081E+00
0.417000E+15 0.155796E+00 -0.199154E-44 0.155796E+00
0.418000E+15 0.155573E+00 -0.200285E-44 0.155573E+00
0.419000E+15 0.155413E+00 -0.201418E-44 0.155413E+00
0.420000E+15 0.155318E+00 -0.202554E-44 0.155318E+00
0.421000E+15 0.155285E+00 -0.203691E-44 0.155285E+00
0.422000E+15 0.155318E+00 -0.204831E-44 0.155318E+00
0.423000E+15 0.155414E+00 -0.205973E-44 0.155414E+00
0.424000E+15 0.155575E+00 -0.207116E-44 0.155575E+00
0.425000E+15 0.155802E+00 -0.208262E-44 0.155802E+00
I managed to just print the first column, but when I multiply it with my value, awk gives me 0. I tried my function with other files where data was formatted differently, and it worked perfectly. I also tried to combine it with bc, without any success.
Does anyone see why in this case awk gives 0 ?
Thanks in advance !
######### EDIT
I just found out that if my data file uses commas and not dots (i.e. 0,400000E+15 instead of 0.400000E+15), my function works fine. So somehow, somewhere, something is configured to understand commas as the scientific notation separator instead of dots. Does that ring a bell to anyone ?
Set LC_ALL=C before executing your script to get the most commonly expected behavior for this and other locale-dependent issues. See http://www.gnu.org/software/gawk/manual/gawk.html#Locales. Also don't pointlessly set FS to it's default value, do quote your shell variables (google that if you don't know why), and do fix the way you are setting your variables to use the form that produces the most intuitive results (see http://cfajohnson.com/shell/cus-faq-2.html#Q24):
LC_ALL=C awk -v col="$column" -v mul="$value" '{print $col*mul}' "$file"
Read the book Effective Awk programming, 4th Edition, by Arnold Robbins.
There is a mismatch between the locale used to create the data file and you current one.
For example the French locale and similar ones use the comma as their decimal separator while the dot is the most widely used, and is also the POSIX default.
If you want for commas to be accepted as decimal separators, you might workaround the issue like this:
LC_NUMERIC=fr_FR.UTF-8 awk '{print $col*mul}' col="$column" mul="$value" "$file"
Note that this won't work as is with GNU awk which doesn't honor the numeric locale setting by default. You would need to use the --use-lc-numeric flag to override.
Alternatively, if you want for dots to be accepted as decimal separators but your current locale is using commas and you are not using GNU awk, you can run this:
LC_NUMERIC=C awk '{print $col*mul}' col="$column" mul="$value" "$file"

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv

Create CSV from specific columns in another CSV using shell scripting

I have a CSV file with several thousand lines, and I need to take some of the columns in that file to create another CSV file to use for import to a database.
I'm not in shape with shell scripting anymore, is there anyone who can help with pointing me in the correct direction?
I have a bash script to read the source file but when I try to print the columns I want to a new file it just doesn't work.
while IFS=, read symbol tr_ven tr_date sec_type sec_name name
do
echo "$name,$name,$symbol" >> output.csv
done < test.csv
Above is the code I have. Out of the 6 columns in the original file, I want to build a CSV with "column6, column6, collumn1"
The test CSV file is like this:
Symbol,Trading Venue,Trading Date,Security Type,Security Name,Company Name
AAAIF,Grey Market,22/01/2015,Fund,,Alternative Investment Trust
AAALF,Grey Market,22/01/2015,Ordinary Shares,,Aareal Bank AG
AAARF,Grey Market,22/01/2015,Ordinary Shares,,Aluar Aluminio Argentino S.A.I.C.
What am I doing wrong with my script? Or, is there an easier - and faster - way of doing this?
Edit
These are the real headers:
Symbol,US Trading Venue,Trading Date,OTC Tier,Caveat Emptor,Security Type,Security Class,Security Name,REG_SHO,Rule_3210,Country of Domicile,Company Name
I'm trying to get the last column, which is number 12, but it always comes up empty.
The snippet looks and works fine to me, maybe you have some weird characters in the file or it is coming from a DOS environment (use dos2unix to "clean" it!). Also, you can make use of read -r to prevent strange behaviours with backslashes.
But let's see how can awk solve this even faster:
awk 'BEGIN{FS=OFS=","} {print $6,$6,$1}' test.csv >> output.csv
Explanation
BEGIN{FS=OFS=","} this sets the input and output field separators to the comma. Alternatively, you can say -F=",", -F, or pass it as a variable with -v FS=",". The same applies for OFS.
{print $6,$6,$1} prints the 6th field twice and then the 1st one. Note that using print, every comma-separated parameter that you give will be printed with the OFS that was previously set. Here, with a comma.

Resources