how to cut columns of csv - shell

I have a set of csv files (around 250), each having 300 to 500 records. I need to cut 2 or 3 columns from each file and store it to another one. I'm using ubuntu OS. Is there any way to do it in command or utility?

If you know that the column delimiter does not occur inside the fields, you can use cut.
$ cat in.csv
foo,bar,baz
qux,quux,quuux
$ cut -d, -f2,3 < in.csv
bar,baz
quux,quuux
You can use the shell buildin 'for' to loop over all input files.

If the fields might contain the delimiter, you ought to find a library that can parse CSV files. Typically, general purpose scripting languages will include a CSV module in their standard library.
Ruby: require 'csv'
Python: import csv
Perl: use Text::ParseWords;

If your fields contain commas or newlines, you can use a helper program I wrote to allow cut (and other UNIX text processing tools) to properly work with the data.
https://github.com/dbro/csvquote
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse the cut program. Then they get restored after cut is done.
lutz' solution would become:
csvquote in.csv | cut -d, -f2,3 | csvquote -u

If you used ssconvert to get the CSV you might try:
ssconvert -O 'separator="|"' "file.xls" "file.txt"
Notice the TXT extension instead CSV, this way will use Gnumeric_stf:stf_assistant exporter instead of Gnumeric_stf:stf_csv, which let you use options (-O parameter). Otherwise you'll get a The file saver does not take options error. Pipe character is much more unlikely, but you might want to check before.
Then you can rename it and do things like:
cat file.csv | cut -d "|" -f3 | sort | uniq -c | sort -rn | head
Other options example: -O 'eol=unix separator=; format=preserve charset=UTF-8 locale=en_US transliterate-mode=transliterate quoting-mode=never'.
A solution with AWK v4+.
ssconvert man page.

Related

How to add a header to text file in bash?

I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.

Text Processing - how to remove part of string from search results using sed?

I am parsing through .xml files looking for names that are inside HTML tags.
I have found what I need, but I would just like to keep the family names.
This is what I have until now (grep command for the names + clean-up of the result, which includes removing the tags and the file name, I will later sort them and leave only unique names):
grep -oP '<name>([A-ZÖÄÜÕŽS][a-zöäüõžš]*)[\s-]([A-ZÖÄÜÕŽS][a-zöäüõžš]*)</name>' *.xml --colour | sed -e 's/<[^>]*>//g' | sed 's/la[0-9]*//' | sed 's/$*.xml://'
The output looks like this:
Mart Kreos
Hans Väär
Karel Väär
Jaan Tibbin
Jüri Kull
I would like to keep the family names, but remove the first names.
I tried to use the following command, but it only worked for some names and not for the others:
sed -r 's/([A-ZÖÄÜÕŽŠ][a-zöäüõžš]+[ ])([A-ZÖÄÜÕŽS][a-zöäüõžš]+)/\2/g'
You should use cut. It is more adapted to what you're trying to achieve here. And you would avoid struggling with UTF-8 characters.
This would give you the expected result for all names in your sample output:
cut -d ' ' -f 2

Can I use grep to extract a single column of a CSV file?

I'm trying to solve o problem I have to do as soon as possible.
I have a csv file, fields separated by ;.
I'm asked to make a shell command using grep to list only the third column, using regex. I can't use cut. It is an exercise.
My file is like this:
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
2;Wayne;Watkins;22;Lanme Place;Cotoiwi;NC;86578
3;Danny;Vega;25;Fofci Center;Momahbih;MS;21027
4;Larry;Robinson;23;Bammek Boulevard;Gaizatoh;NE;27517
5;Myrtie;Black;20;Savon Square;Gokubpat;PA;92219
6;Nellie;Greene;23;Utebu Plaza;Rotvezri;VA;17526
7;Clyde;Reynolds;19;Lupow Ridge;Kedkuha;WI;29749
8;Calvin;Reyes;47;Paad Loop;Beejdij;KS;29247
9;Douglas;Graves;43;Gouk Square;Sekolim;NY;13226
10;Josephine;Estrada;48;Ocgig Pike;Beheho;WI;87305
11;Eugene;Matthews;26;Daew Drive;Riftemij;ME;93302
12;Stanley;Tucker;54;Cure View;Woocabu;OH;45475
13;Lina;Holloway;41;Sajric River;Furutwe;ME;62184
14;Hettie;Carlson;57;Zuheho Pike;Gokrobo;PA;89098
15;Maud;Phelps;57;Lafni Drive;Gokemu;MD;87066
16;Della;Roberson;53;Zafe Glen;Celoshuv;WV;56749
17;Cory;Roberson;56;Riltav Manor;Uwsupep;LA;07983
18;Stella;Hayes;30;Omki Square;Figjitu;GA;35813
19;Robert;Griffin;22;Kiroc Road;Wiregu;OH;39594
20;Clyde;Reynolds;19;Lupow Ridge;Kedkuha;WI;29749
21;Calvin;Reyes;47;Paad Loop;Beejdij;KS;29247
22;Douglas;Graves;43;Gouk Square;Sekolim;NY;13226
23;Josephine;Estrada;48;Ocgig Pike;Beheho;WI;87305
24;Eugene;Matthews;26;Daew Drive;Riftemij;ME;93302
I think I should use something like: cat < test.csv | grep 'regex'.
Thanks.
Right Tools For The Job: Using awk or cut
Assuming you want to match the third column against a specific field:
awk -F';' '$3 ~ /Foo/ { print $0 }' file.txt
...will print any line where the third field contains Foo. (Changing print $0 to print $3 would print only that third field).
If you just want to print the third column regardless, use cut: cut -d';' -f3 <file.txt
Wrong Tool For The Job: Using GNU grep
On a system where grep has the -o option, you can chain two instances together -- one to trim everything after the fourth column (and remove lines with less than four columns), another to take only the last remaining column (thus, the fourth):
str='foo;bar;baz;qux;meh;whatever'
grep -Eo '^[^;]*[;][^;]*[;][^;]*[;][^;]*' <<<"$str" \
| grep -Eo '[^;]+$'
To explain how that works:
^, outside of square brackets, matches only at the beginning of a line.
[^;]* matches any character except ; zero-or-more times.
[;] matches only the character ;.
...thus, each [^;]*[;] in the regex matches a single field, whether or not that field contains text. Putting four of those in the first stage means we're matching only fields, and grep -o tells grep to only emit content it was successfully able to match.
If you just need the 3rd field and it's always properly delimited with ';' why not use 'cut'?
cut -d';' -f3 <filename>
UPDATED:
OP wasn't clear, maybe only want to look at the 3rd line?
head -3 <filename> | tail -1
OR.. Maybe just getting of list of the things that appear in the 3rd field?
Not clear what the intended use of 'grep' would be??
cut -d';' -f3 <filename> | sort -u
As the other answers have said, using grep is a bad/unfortunate idea.
The only way I can think of using grep is to pull out a specific row where the 3rd column == some value. E.g.,
grep '^\([^;]*;\)\{2\}Bell;' test.txt
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
Or if the first column is the index (not counting it as a column):
grep '^\([^;]*;\)\{3\}39;' test.txt
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
Even using grep in this case leads to a pretty ugly solution.
Edit: Didn't see Charles Duffy's answer... that's pretty clever.

My .gz/.zip file contains a huge text file; without saving that file unpacked to disk, how to extract its lines that match a regular expression?

I have a file.gz (not a .tar.gz!) or file.zip file. It contains one file (20GB-sized text file with tens of millions of lines) named 1.txt.
Without saving 1.txt to disk as a whole (this requirement is the same as in my previous question), I want to extract all its lines that match some regular expression and don't match another regex.
The resulting .txt files must not exceed a predefined limit, say, one million lines.
That is, if there are 3.5M lines in 1.txt that match those conditions, I want to get 4 output files: part1.txt, part2.txt, part3.txt, part4.txt (the latter will contain 500K lines), that's all.
I tried to make use of something like
gzip -c path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But the above code doesn't work. Maybe Bash can do it, as in my previous question, but I don't know how.
You can perhaps use zgrep.
zgrep [ grep_options ] [ -e ] pattern filename.gz ...
NOTE: zgrep is a wrapper script (installed with gzip package), which essentially uses the same command internally as mentioned in other answers.
However, it looks more readable in the script & easier to write the command manually.
I'm afraid It's imposible, quote from gzip man:
If you wish to create a single archive file with multiple members so
that members can later be extracted independently, use an archiver
such as tar or zip.
UPDATE: After de edit, if the gz only contains one file , a one step tool like awk shoul be fine:
gzip -cd path/to/test/file.gz | awk 'BEGIN{global=1}/my regex/{count+=1;print $0 >"part"global".txt";if (count==1000000){count=0;global+=1}}'
split is also a good choice but you will have to rename files after it.
Your solution is almost good. The problem is that You should specify for gzip what to do. To decompress use -d. So try:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -l1000000
But with this you will have a bunch of files like xaa, xab, xac, ... I suggest to use the PREFIX and numeric suffixes features to create better output:
gzip -dc path/to/test/file.gz | grep -P --regexp='my regex' | split -dl1000000 - file
In this case the result files will look like: file01, file02, fil03 etc.
If You want to filter out some not matching perl style regex, you can try something like this:
gzip -dc path/to/test/file.gz | grep -P 'my regex' | grep -vP 'other regex' | split -dl1000000 - file
I hope this helps.

Extract Data from CSV in shell script (Sed, AWK, Grep?)

I need to extract some data from a CSV file. The CSV is a 2 column file with multiple records. The first column is the date, the second column is the data that needs to be extracted. The first row of the CSV file is the column headers, so it can be skipped. And I've already created the column header for the extracted data's csv file, so theres no need for that, I'll simply use >> to import the data into it.
Here is 1 record/line (of many) in the CSV file:
"2009-09-20 00:12:37","a:2:{s:15:""info_buyRequest"";a:5:{s:4:""uenc"";s:116:""aHR0cDovL3N0b3JlLmZvcmdldGhhbmdvdmVycy5jb20vcGF0Y2hlcy9pbmRpdmlkdWFsLXBhdGNoZXMvZnJlZS1zYW1wbGUuaHRtbD9fX19TSUQ9VQ,,"";s:7:""product"";s:1:""1"";s:15:""related_product"";s:0:"""";s:7:""options"";a:13:{i:17;s:2:""59"";i:16;s:2:""50"";i:15;s:2:""49"";i:14;s:2:""47"";i:13;s:2:""41"";i:12;s:2:""34"";i:11;s:2:""25"";i:10;s:2:""23"";i:9;s:2:""19"";i:8;s:2:""17"";i:7;s:2:""12"";i:6;s:1:""9"";i:5;s:1:""5"";}s:3:""qty"";i:1;}s:7:""options"";a:13:{i:0;a:7:{s:5:""label"";s:25:""How did you hear about us"";s:5:""value"";s:22:""Friend / Family Member"";s:11:""print_value"";s:22:""Friend / Family Member"";s:9:""option_id"";s:2:""17"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""59"";s:11:""custom_view"";b:0;}i:1;a:7:{s:5:""label"";s:3:""Age"";s:5:""value"";s:5:""21-24"";s:11:""print_value"";s:5:""21-24"";s:9:""option_id"";s:2:""16"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""50"";s:11:""custom_view"";b:0;}i:2;a:7:{s:5:""label"";s:14:""Marital Status"";s:5:""value"";s:9:""UnMarried"";s:11:""print_value"";s:9:""UnMarried"";s:9:""option_id"";s:2:""15"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""49"";s:11:""custom_view"";b:0;}i:3;a:7:{s:5:""label"";s:3:""Sex"";s:5:""value"";s:6:""Female"";s:11:""print_value"";s:6:""Female"";s:9:""option_id"";s:2:""14"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""47"";s:11:""custom_view"";b:0;}i:4;a:7:{s:5:""label"";s:10:""Occupation"";s:5:""value"";s:7:""Student"";s:11:""print_value"";s:7:""Student"";s:9:""option_id"";s:2:""13"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""41"";s:11:""custom_view"";b:0;}i:5;a:7:{s:5:""label"";s:9:""Education"";s:5:""value"";s:16:""College Graduate"";s:11:""print_value"";s:16:""College Graduate"";s:9:""option_id"";s:2:""12"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""34"";s:11:""custom_view"";b:0;}i:6;a:7:{s:5:""label"";s:16:""Household Income"";s:5:""value"";s:7:""30K-50K"";s:11:""print_value"";s:7:""30K-50K"";s:9:""option_id"";s:2:""11"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""25"";s:11:""custom_view"";b:0;}i:7;a:7:{s:5:""label"";s:23:""Do You Take Supplements"";s:5:""value"";s:2:""No"";s:11:""print_value"";s:2:""No"";s:9:""option_id"";s:2:""10"";s:11:""option_type"";s:5:""radio"";s:12:""option_value"";s:2:""23"";s:11:""custom_view"";b:0;}i:8;a:7:{s:5:""label"";s:40:""How would you rank your typical hangover"";s:5:""value"";s:4:""Mild"";s:11:""print_value"";s:4:""Mild"";s:9:""option_id"";s:1:""9"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""19"";s:11:""custom_view"";b:0;}i:9;a:7:{s:5:""label"";s:51:""What type of establishments do you typically prefer"";s:5:""value"";s:10:""Nightclubs"";s:11:""print_value"";s:10:""Nightclubs"";s:9:""option_id"";s:1:""8"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""17"";s:11:""custom_view"";b:0;}i:10;a:7:{s:5:""label"";s:40:""How often do you usually go out per week"";s:5:""value"";s:3:""1-2"";s:11:""print_value"";s:3:""1-2"";s:9:""option_id"";s:1:""7"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:2:""12"";s:11:""custom_view"";b:0;}i:11;a:7:{s:5:""label"";s:49:""How many drinks do you typically consume per week"";s:5:""value"";s:3:""6-8"";s:11:""print_value"";s:3:""6-8"";s:9:""option_id"";s:1:""6"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""9"";s:11:""custom_view"";b:0;}i:12;a:7:{s:5:""label"";s:53:""How would you prefer to buy our Products"";s:5:""value"";s:6:""Online"";s:11:""print_value"";s:6:""Online"";s:9:""option_id"";s:1:""5"";s:11:""option_type"";s:9:""drop_down"";s:12:""option_value"";s:1:""5"";s:11:""custom_view"";b:0;}}}"
The Output should be the data found here:
""print_value";s:?:""{DATA}""
Were the ? is a number, and {DATA} is the data being extracted.
So the output for example of this 1 record would be:
"2009-09-20 00:12:37","Friend / Family Member","21-24","UnMarried","Female","Student","College Graduate","30K-50K","No","Mild","Nightclubs","1-2","6-8","Online"
I am not proficient in Sed,AWK, or Grep, but I know it can be done using one of these tools if not all three. Any help or nudges in the right direction would be GREATLY appreciated.
I suggest you use PHP to de-serialize the structure.
However, here's a quick and dirty version of what you want using sed and tr. Certainly you can do this much much better:
cat file.csv | \
tr ",;" "\n" | \
sed -e 's/[asbi]:[0-9]*[:]*//g' -e '/^[{}]/d' -e 's/""//g' -e '/^"{/d' | \
sed -n -e '/^"/p' -e '/^print_value$/,/^option_id$/p' | \
sed -e '/^option_id/d' -e '/^print_value/d' -e 's/^"\(.*\)"$/\1/' | \
tr "\n" "," | \
sed -e 's/,\([0-9]*-[0-9]*-[0-9]*\)/\n\1/g' -e 's/,$//' | \
sed -e 's/^/"/g' -e 's/$/"/g' -e 's/,/","/g'
The explanation:
split by commas and semicolons
remove remove the php structure syntax s:X:Y, b:X, ... and remove lines starting with { or } or "{
extract the section from print_value to the next option_id, also keep the date (line start with ")
remove those labels (print and option), and remove quotations around the date
concat all lines with commas
seperate lines (starting with date pattern), and remove extra comma at end
add quotations around all fields
Wow, I know it's embarrassing :)
Here is my anwser:
cat TestData \
| grep -o -P "print_value\"\";.*?:\"\".*?\"\";" \
| perl -pe 's|print_value.*:\"\"(.*?)\"\";|\1|'
The first line show the data (stored in TestData).
The second line asks grep to separate each match from print_value to the nearest '"";'.
Notice that I use '.*?' for non greedy match (needs to use '-P' with it).
The last line use perl to strip all un-needed. See that I use '(.*?)' to match the needed group and use '\1' to show the group.
Hope this helps.
Here's a sed oneliner:
sed -nr 's/^([^,]+),(.*)$/\2#%#\1/;:a;s/""print_value"";s:[0-9]+:""([^"]+)""(.*)$/\2,"\1"/;ta;s/^.*#%#//p' <source
Basically extract the data and append it to the end of the line using a unique delimiter '#%#'.
When the loop/substitute construct fails (i.e. no more data), throw away what is left of the original line leaving the data nicely formatted.

Resources