Related
I have a giant file (6gb) which is a csv and the rows look like so:
"87687","institute Polytechnic, Brazil"
"342424","university of India, India"
"24343","univefrsity columbia, Bogata, Colombia"
and I would like to remove all punctuation and lower the case of second column yielding:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
what would be the most efficient way to do this on the terminal?
Tried:
cat TEXTFILE | tr -d '[:punct:]' > OUTFILE
problem: resultant is not in lowercase and tr seems to act on both columns not just the ssecond.
With a real CSV parser in Perl, the robust/reliable way, using just one process.
As far as it's line by line, the 6GB requirement of file size should not be an issue.
#!/usr/bin/perl
use strict; use warnings; # harness
use Text::CSV; # load the needed module (install it)
use feature qw/say/; # say = print("...\n")
# create an instance of a new CSV parser
my $csv = Text::CSV->new({ auto_diag => 1 });
# open a File Handle or exit with error
open my $fh, "<:encoding(utf8)", "file.csv" or die "file.csv: $!";
while (my $row = $csv->getline ($fh)) { # parse line by line
$_ = $row->[1]; # parse only column 2
s/[\s[:punct:]]//g; # removes both space(s) and punct(s)
$_ = lc $_; # Lower Case current value $_
$row->[1] = qq/"$_"/; # edit changes and (re)"quote"
say join ",", #$row; # display the whole current row
}
close $fh; # close the File Handle
Output
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
install
cpan Text::CSV
Here's an approach using xsv and process substitution:
paste -d, \
<(xsv select 1 infile.csv) \
<(xsv select 2 infile.csv | sed 's/[[:blank:][:punct:]]*//g;s/.*/\L&/')
The sed command first removes all blanks and punctuation, then lowercases the entire match.
This also works when the first field contains blanks and commas, and retains quoting where required.
Using sed
$ sed -E ':a;s/([^,]*,)([^ ,]*)[ ,]([[:alpha:]]+)/\1\L\2\3/;ta' input_file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia
I suggest using this awk solution, which should work with any version of awk:
awk 'BEGIN{FS=OFS="\",\""} {
gsub(/[^[:alnum:]"]+/, "", $2); $2 = tolower($2)} 1' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Details:
We make "," input and output field separators in BEGIN block
gsub(/[^[:alnum:]"]+/, "", $2): Strip all non-alphanumeric characters except "
$2 = tolower($2): Lowercase second column
One GNU awk (for gensub()) idea:
awk '
BEGIN { FS=OFS="\"" }
{ $4=gensub(/[^[:alnum:]]/,"","g",tolower($4)) }
1'
This generates:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Another sed approach -
sed -E 's/ +//g; s/([^"]),/\1/g; s/"([^"]*)"/"\L\1"/g' file
I don't like how that leaves no flexibility, and makes you rewrite the logic if you find something else you want to remove, though.
Another in awk -
awk -F'[", ]+' '
{ printf "\"%s\",\"", $2;
for(c=3;c<=NF;c++) printf "%s", tolower($c);
print "\"";
}' file
This approach lets you define and add any additional offending characters into the field delimiters without editing your logic.
$: pat=$"[\"',_;:!##\$%)(* -]+"
$: echo "$pat"
["',_;:!##$%)(* -]+
$: cat file
"87687","institute 'Polytechnic, Brazil"
"342424","university; of-India, India"
"24343","univefrsity )columbia, Bogata, Colombia"
$: awk -F"$pat" '{printf "\"%s\",\"", $2; for(c=3;c<=NF;c++) printf "%s", tolower($c); print "\"" }' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
(I hate the way that lone single quote throws the markup color/format parsing off, lol)
Another way using ruby. Edited the data to show only the second field is modified.
% ruby -r 'csv' -e 'f = open("file");
CSV.parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Using FastCSV gives a huge speedup
gem install fastcsv
% ruby -r 'fastcsv' -e 'f = open("file");
FastCSV.raw_parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Data
% cat file
"8768, 7","institute Polytechnic, Brazil"
"342 424","university of India, India"
"243 43","univefrsity columbia, Bogata, Colombia"
With your shown samples and attempts please try following GNU awk code using match function of it. Using regex (^"[^"]*",")([^"]*)(".*)$ in match function which will create 3 capturing groups and will store the value into arr and respectively I am fetching the values of it later in program to meet OP's requirement.
awk '
match($0,/(^"[^"]*",")([^"]*)(".*)$/,arr){
gsub(/[^[:alnum:]]+/,"",arr[2])
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
This might work for you (GNU sed):
sed -E s'/("[^"]*",)/\1\n/;h;s/.*\n//;s/[[:punct:] ]//g;s/.*/"\L&"/;H;g;s/\n.*\n//' file
Divide and rule.
Partition the line into two fields, make a copy, process the second field removing punctuation and spaces, re-quote and lowercase and then re-assemble the fields
An alternative, perhaps?
sed -E ':a;s/^("[^"]*",".*)[^[:alpha:]"](.*)/\L\1\2/;ta' file
Here is a way to do so in PHP.
Note: PHP will not output double quotes unless needed by the first column. The second column will never need double quotes, it has no space or special characters.
$max_line_length = 100;
if (($fp = fopen("file.csv", "r")) !== FALSE) {
while (($data = fgetcsv($fp, $max_line_length, ",")) !== FALSE) {
$data[1] = strtolower(preg_replace('/[\s[:punct:]]/', '', $data[1]));
fputcsv(STDOUT, $data, ',', '"');
}
fclose($fp);
}
I want to turn Unicode text into pure ASCII encoding using escape sequences.
Input :Ɏɇ衳 outputs to ... "\u024E\u0247\u8873"
Basically the opposite of this.
$ echo -e "\u024E\u0247\u8873"
Ɏɇ衳
I want the encoding to stay in utf8, all I'm doing is changing forms.
I've Tried:
iconv -f utf8 -t utf8 $file
iconv -f utf8 -t utf16 $file
Your mentioned codes 024E, 0247, .. are called Unicode code points and are independent from UTF-8 or UTF-16.
If perl is your option, you can retrieve the codes with:
perl -C -ne 'map {printf "\\u%04X", ord} (/./g)' <<< "Ɏɇ衳"; echo
which outputs:
\u024E\u0247\u8873
Explanation
The perl code above is mostly equivalent to:
#!/usr/bin/perl
use utf8;
$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
printf "\\u%04X", ord($chr);
}
print "\n";
use utf8 specifies the string is encoded in UTF-8 (just because the string is embedded in the script).
($str =~ /./g) brakes the string into an array of characters.
foreach iterates over the array of characters.
ord returns the code point of the given character.
EDIT
If you want to auto-scale the number of digits considering the out-of-BMP characters, try instead:
#!/usr/bin/perl
use utf8;
$str = "Ɏɇ衳";
foreach $chr ($str =~ /./g) {
$n = ord($chr);
$d = $n > 0xffff ? 8 : 4;
printf "\\u%0${d}X", $n;
}
If you have that in a file you can use iconv.
iconv -f $input_encoding -t $output_encoding $file
check "man iconv" for more details
i would like some help with a substitution i want to do on the lines of a file that look like this :
aoipp;dadada.12312;ss;1245454;Xiop;12.12;45.3;47.897;31.5;
asdfafd;14355.54664;peasd;125.1;900.2;76.897;67.456;asdfdf;
perio;777.2;ipoes;900.34;2;1980.45;870.98;67.67;
I want to replace every . with , but only after the fifth occurrence of the delimiter ;. Everything else needs to remain unchanged. So the desired output file would look like this :
aoipp;dadada.12312;ss;1245454;Xiop;12,12;45,3;47,897;31,5;
asdfafd;14355.54664;peasd;125.1;900.2;76,897;67,456;asdfdf;
perio;777.2;ipoes;900.34;2;1980,45;870,98;67,67;
I m interested in doing this primarily in perl so i can incorporate it to a larger program, but any solutions in bash / awk are welcome as well. Thanks in advance.
This awk one-liner should work for you:
awk -F';' -v OFS=";" '{for(i=6;i<=NF;i++)gsub("[.]",",",$i)}7' file
It starts from the 6th field (; separated), for each field replace all . by ,.
Test with your data:
kent$ cat f
aoipp;dadada.12312;ss;1245454;Xiop;12.12;45.3;47.897;31.5;
asdfafd;14355.54664;peasd;125.1;900.2;76.897;67.456;asdfdf;
perio;777.2;ipoes;900.34;2;1980.45;870.98;67.67;
kent$ awk -F';' -v OFS=";" '{for(i=6;i<=NF;i++)gsub("[.]",",",$i)}7' f
aoipp;dadada.12312;ss;1245454;Xiop;12,12;45,3;47,897;31,5;
asdfafd;14355.54664;peasd;125.1;900.2;76,897;67,456;asdfdf;
perio;777.2;ipoes;900.34;2;1980,45;870,98;67,67;
I used an array slice #fields[ 5 .. $#fields ] to access only the elements to be changed.
#!/usr/bin/perl
use warnings;
use strict;
my #input = qw( aoipp;dadada.12312;ss;1245454;Xiop;12.12;45.3;47.897;31.5;
asdfafd;14355.54664;peasd;125.1;900.2;76.897;67.456;asdfdf;
perio;777.2;ipoes;900.34;2;1980.45;870.98;67.67;
);
my #expected = qw( aoipp;dadada.12312;ss;1245454;Xiop;12,12;45,3;47,897;31,5;
asdfafd;14355.54664;peasd;125.1;900.2;76,897;67,456;asdfdf;
perio;777.2;ipoes;900.34;2;1980,45;870,98;67,67;
);
sub process {
my (#input) = #_;
my #output;
for my $line (#input) {
my #fields = split /;/, $line;
s/\./,/ for #fields[ 5 .. $#fields ];
push #output, join ';', #fields, q();
}
return \#output
}
use Test::More tests => 1;
is_deeply(process(#input), \#expected);
while (my $line = <DATA>) {
if ($line =~ /^(?:[^;]*;){5}/) {
substr($line, $+[0]) =~ y/./,/;
}
print $line;
}
__DATA__
aoipp;dadada.12312;ss;1245454;Xiop;12.12;45.3;47.897;31.5;
asdfafd;14355.54664;peasd;125.1;900.2;76.897;67.456;asdfdf;
perio;777.2;ipoes;900.34;2;1980.45;870.98;67.67;
perl -pe 's/(.*?;){6}\K(.*)/$2 =~ s!\.!,!rg /ge'
Skip everything until the 6th ; ((.*?;){6}\K),
and aply the substitution . , to rest of the line ($2 =~ s!\.!,!rg)
# this should do your work
sed -i 's/;/,/6g' filename
cat filename
aoipp;dadada.12312;ss;1245454;Xiop;12.12,45.3,47.897,31.5,
asdfafd;14355.54664;peasd;125.1;900.2;76.897,67.456,asdfdf,
perio;777.2;ipoes;900.34;2;1980.45,870.98,67.67,
The following
echo text | perl -lnE 'say "word: $_\t$_"'
prints
word: text text
I need
word: 'text' 'text'
Tried:
echo text | perl -lnE 'say "word: \'$_\' \'$_\'";' #didn't works
echo text | perl -lnE 'say "word: '$_' '$_'";' #neither
How to correctly escape the single quotes for bash?
Edit:
want prepare a shell script with a couple of mv lines (for checking, before really renames the files), e.g tried to solve the following:
find . type f -print | \
perl \
-MText::Unaccent::PurePerl=unac_string \
-MUnicode::Normalize=NFC -CASD \
-lanE 'BEGIN{$q=chr(39)}$o=$_;$_=unac_string(NFC($_));s/[{}()\[\]\s\|]+/_/g;say "mv $q$o$q $_"' >do_rename
e.g. from the filenames like:
Somé filénamé ČŽ (1980) |Full |Movie| Streaming [360p] some.mp4
want get the following output in the file do_rename
mv 'Somé filénamé ČŽ (1980) |Full |Movie| Streaming [360p] some.mp4' Some_filename_CZ_1980_Full_Movie_Streaming_360p_some.mp4
and after the manual inspection want run:
bash do_rename
for running the actual rename...
You can use ASCII code 39 for ' to avoid escape hell,
echo text | perl -lnE 'BEGIN{ $q=chr(39) } say "word: $q$_$q\t$q$_$q"'
You can use:
echo text | perl -lnE "say \"word: '\$_'\t'\$_'\""
word: 'text' 'text'
BASH allows you to include escaped double quote inside a double quote but same doesn't apply for single quoted. However while doing so we need to escape $ to avoid escaping from BASH.
OK, based on the statement of the problem you're having. My suggestion would be - don't pipe find to perl, that's just asking for all kinds of annoyance.
I'm not entirely familiar with the modules, but would suggest you try something like this:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Text::Unaccent::PurePerl qw ( unac_string );
use Unicode::Normalize qw ( NFC );
use Getopt::Std;
use File::Copy qw ( move );
use Encode qw(decode_utf8);
my %opts;
#x to execute, p to specify a path.
#p mandatory.
getopts('xp:',\%opts);
#this sub is called for each file, by the find function.
#$File::Find::name is the full path to the file.
#$_ is just the filename.
sub rename_unicode_files {
#skip if it's not a file.
next unless -f $File::Find::name;
#convert name with functions from your example.
my $newname = unac_string(NFC(decode_utf8($File::Find::name)));
$newname =~ s/[{}()\[\]\s\|]+/_/g;
#could apply other transforms here, such as regular expressions.
#if the two names are different, consider moving.
unless ( $newname eq $File::Find::Name ) {
print "Would rename: $File::Find::Name to $newname\n";
#actually do it, if '-x' is specified.
if ( $opts{x} ) { move ( $File::Find::name, $newname ); };
}
}
#require -p <pathname> or otherwise print how to use.
unless ( -d $opts{p} ) {
print "Usage: $0 -p <pathname> [-x]\n";
exit;
}
#trigger find with callback to subroutine, over the '-p <path>'.
find ( \&rename_unicode_files, $opts{p} );
Extend with something like GetOpt::Std to check if you've specified an option - so you run normally, you get 'this is what I would do' and if you specify a particular flag, it actually does it.
And either use the perl builtin rename or the one available from File::Copy
This will neatly avoid a lot of the escaping and interpolating problems you're having, and I think leave you with generally more readable and useful code.
Edit: Given a comment suggesting that the above is 'too long' how about:
#!/usr/bin/perl
use File::Find; use Text::Unaccent::PurePerl qw ( unac_string ); use Unicode::Normalize qw ( NFC ); find( sub { next unless -f $name; print "mv \'$File::Find::Name\' \'",unac_string( NFC($File::Find::name) )."\'\n"; }, "." );
Still not convinced of the values of the approach. Even if it is only run occasionally - that's even more reason to make it as clear as possible.
There is a trivial solution using zsh shell and SQL-like (at least PostgreSQL and Oracle) quoting style:
$ setopt rc_quotes
$ echo text | perl -lnE 'say "word: ''$_''\t''$_''"'
word: 'text' 'text'
To quote a ' you simply double it and use '' in this mode.
I have a .h file, among other things, containing data in this format
struct X[]{
{"Field", "value1 value2 value"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd sdfdsf sdfs"};
/****************/
};
I have text file containing, values that I want to replace in .h file with new values
value1 Valuesdfdsf1
value2 Value1dfsdf
value3 Value1_another
sfsd sfsd_ewew
sdfdsf sdfdsf_ew
sdfs sfsd_new
And the resulting .h file will contain the replacements from the text file above. Everything else remains the same.
struct X[]{
{"Field1", "value11 value12 value232"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd_ewew sdfdsf_ew sdfs_new"};
/****************/
};
Please help me come with a solution to accomplish it using unix tools: awk, perl, bash, sed, etc
cat junk/n2.txt | perl -e '{use File::Slurp; my #r = File::Slurp::read_file("junk/n.txt"); my %r = map {chomp; (split(/\s+/,$_))[0,1]} #r; while (<>) { unless (/^\s*{"/) {print $_; next;}; my ($pre,$values,$post) = ($_ =~ /^(\s*{"[^"]+", ")([^"]+)(".*)$/); my #new_values = map { exists $r{$_} ? $r{$_}:$_ } split(/\s+/,$values); print $pre . join(" ",#new_values) . $post . "\n"; }}'
Result:
struct X[]{
{"Field", "value1 Value1dfsdf value"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd_ewew sdfdsf_ew sfsd_new"};
/****************/
};
Code untangled:
use File::Slurp;
my #replacements = File::Slurp::read_file("junk/n.txt");
my %r = map {chomp; (split(/\s+/,$_))[0,1]} #replacements;
while (<>) {
unless (/^\s*{"/) {print $_; next;}
my ($pre,$values,$post) = ($_ =~ /^(\s*{"[^"]+", ")([^"]+)(".*)$/);
my #new_values = map { exists $r{$_} ? $r{$_} : $_ } split(/\s+/, $values);
print $pre . join(" ",#new_values) . $post . "\n";
}
#!/usr/bin/perl
use strict; use warnings;
# you need to populate %lookup from the text file
my %lookup = qw(
value1 Valuesdfdsf1
value2 Value1dfsdf
value3 Value1_another
sfsd sfsd_ewew
sdfdsf sdfdsf_ew
sdfs sfsd_new
);
while ( my $line = <DATA> ) {
if ( $line =~ /^struct \w+\Q[]/ ) {
print $line;
process_struct(\*DATA, \%lookup);
}
else {
print $line;
}
}
sub process_struct {
my ($fh, $lookup) = #_;
while (my $line = <$fh> ) {
unless ( $line =~ /^{"(\w+)", "([^"]+)"}([,;])\s+/ ) {
print $line;
return;
}
my ($f, $v, $p) = ($1, $2, $3);
$v =~ s/(\w+)/exists $lookup->{$1} ? $lookup->{$1} : $1/eg;
printf qq|{"%s", "%s"}%s\n|, $f, $v, $p;
}
return;
}
__DATA__
struct X[]{
{"Field", "value1 value2 value"},
{"Field2", "value11 value12 value232"},
{"Field3", "x y z"},
{"Field4", "a bbb s"},
{"Field5", "sfsd sdfdsf sdfs"};
/****************/
};
Here's a simple looking program:
use strict;
use warnings;
use File::Copy;
use constant {
OLD_HEADER_FILE => "headerfile.h",
NEW_HEADER_FILE => "newheaderfile.h",
DATA_TEXT_FILE => "data.txt",
};
open (HEADER, "<", OLD_HEADER_FILE) or
die qq(Can't open file old header file ") . OLD_HEADER_FILE . qq(" for reading);
open (NEWHEADER, ">", NEW_HEADER_FILE) or
die qq(Can't open file new header file ") . NEW_HEADER_FILE . qq(" for writing);
open (DATA, "<", DATA_TEXT_FILE) or
die qq(Can't open file data file ") . DATA_TEXT_FILE . qq(" for reading);
#
# Put Replacement Data in a Hash
#
my %dataHash;
while (my $line = <DATA>) {
chomp($line);
my ($key, $value) = split (/\s+/, $line);
$dataHash{$key} = $value if ($key and $value);
}
close (DATA);
#
# NOW PARSE THOUGH HEADER
#
while (my $line = <HEADER>) {
chomp($line);
if ($line =~ /^\s*\{"Field/) {
foreach my $key (keys(%dataHash)) {
$line =~ s/\b$key\b/$dataHash{$key}/g;
}
}
print NEWHEADER "$line\n";
}
close (HEADER);
close (NEWHEADER);
copy(NEW_HEADER_FILE, OLD_HEADER_FILE) or
die qq(Unable to replace ") . OLD_HEADER_FILE . qq(" with ") . NEW_HEADER_FILE . qq(");
I could make it more efficient by using map, but that makes it harder to understand.
Basically:
I open three files, the original Header, the new Header I'm building, and the data file
I first put my data into a hash where the replacement text is keyed by the original text. (Could have done it the other way around if I wanted.
I then go through each line of the original header.
** If I see a line that looks like its a field line, I know that I might have to do a replacement.
** For each entry in my %dataHash, I do a substitution of the $key with the $dataHash{$key} replacement value. I use the \b to mark word boundries. This way, field11 is not substituted because I see field1 in that string.
** Now I write the line back to my new header file. If I didn't replace anything, I just write back the original line.
Once I finish, I copy the new header over the old header file.
This script should work
keyval is the file containing key value pairs
filetoreplace is the file containing data to be modified
The file named changed will contain the changes
#!/bin/sh
echo
keylist=`cat keyval | awk '{ print $1}'`
while read line
do
for i in $keylist
do
if echo $line | grep -wq $i; then
value=`grep -w $i keyval | awk '{print $2}'`
line=`echo $line | sed -e "s/$i/$value/g"`
fi
done
echo $line >> changed
done < filetoreplace
This might be kind of slow if your files are big.
gawk -F '[ \t]*|"' 'FNR == NR {repl[$1]=$2;next}{for (f=1;f<=NF;++f) for (r in repl) if ($f == r) $f=repl[r]; print} ' keyfile file.h