I have a script of my own wherein I want to include a command that would copy header of a csv file into a new csv file. I don't know how to accomplish this task.
Thanks in advance.
CSV files can contains elements, that contains linefeeds. In this case, the first logical CSV line is spread over 2 ore more physical lines. So physical line based tools like head, tail, grep awk or sed can't be used.
Example for a 1 line CSV line with 2 columns:
"first
column","second
column"
You have to use tools that support CSV analysis like python or php. Here is a php example:
php -r 'fputcsv(STDOUT,fgetcsv(STDIN));'
It reads from stdin and writes the first logic line to stdout.
Usage examples:
php -r 'fputcsv(STDOUT,fgetcsv(STDIN));' <old.csv >new.csv
any_command | php -r 'fputcsv(STDOUT,fgetcsv(STDIN));' >new.csv
$ head -1 old.csv > new.csv
Related
I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.
How to split a large csv file (~100GB) and preserve the header in each part ?
For instance
h1 h2
a aa
b bb
into
h1 h2
a aa
and
h1 h2
b bb
First you need to separate the header and the content :
header=$(head -1 $file)
data=$(tail -n +2 $file)
Then you want to split the data
echo $data | split [options...] -
In the options you have to specify the size of the chunks and the pattern for the name of the resulting files. The trailing - must not be removed as it specifies split to read data from stdin.
Then you can insert the header at the top of each file
sed -i "1i$header" $splitOutputFile
You should obviously do that last part in a for loop, but its exact code will depend on the prefix chosen for the split operation.
I found any previous solutions to this to not work properly on the mac systems that my script was targeting (why Apple? why?) I eventually ended up with a printf option that worked out pretty good as a proof of concept. I'm going to enhance this by putting the temporary files into a ramdisk and the like to improve performance since it is putting a bunch on disk as is and will probably be slow.
#!/bin/sh
# Pass a file in as the first argument on the command line (note, not secure)
file=$1
# Get the header file out
header=$(head -1 $file)
# Separate the data from the header
tail -n +2 $file > output.data
# Split the data into 1000 lines per file (change as you wish)
split -l 1000 output.data output
# Append the header back into each file from split
for part in `ls -1 output*`
do
printf "%s\n%s" "$header" "`cat $part`" > $part
done
you may download a freeware CsvSplitter from here. It is a zip from the website that contains a simple portable .exe file and a .txt file, necessary to work along with the executable, just extract the content in some directory and you're ready to work:
and it can split the file as can be seen in this picture
Everything is self-explanatory but more details can be found here
I have a number of .csv files all with the same structure of 22 columns. I only require columns 5,14 and 15 so use:
$ cut -d, -f5,14,1 original.csv > new_original.csv
However I will soon have a number of csv coming in daily and need to use a loop function to perform this on each csv file, and add a prefix "new_"for example to the file name. Alternatively I don't mind -i editing in place.
Thanks
You can run the following in the directory that contains the csv files.
for file in *.csv
do
cut -d, -f5,14,1 "$file" > "new_$file.csv"
done
This will loop over each of them, perform the filtering and output to the same name prefixed with new_.
i want to parse a csv file in a shell script. I want to input the name of the file at the prompt. like :
somescript.sh filename
Can it be done?
Also, I will read user input to display a particular a particular data number in the csv.
For example, say the csv file has 10 values in each line:
1,2,3,4,5,6,7,8,9,10
And I want to read the 5th value. how can I do it?
And multiple lines are involved.
Thanks.
If your file is really in such a simple format (just commas, no spaces), then cut -d, -f5 would do the trick.
#!/bin/sh
awk -F, "NR==$2{print \$$3}" "$1"
Usage:
./test.sh FILENAME LINE_NUMBER FIELD_NUMBER
I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.
Sample of the CSV file
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:
while read line ; do
name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`
echo "${info}" >> ${BASEDIR}/${today}/${name}.txt
done < ${file}
After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.
What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.
The original while loop with multiple commands per line is DIRE!
sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file
The sed script says:
write lines that match the GVA tag to gva.file
write lines that match the HBVL tag to hbvl.file
and don't print anything else ('-n')
Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.
grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt
# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0