substitute characters but not the last one - ruby

I have a string for example like this:
str = 'TEST;NAME=1;TARGET_SOMETHING;PLATFORM_INTEL;'
Now I would like to substitute all ";" with "-D" and delete the last ";"
I'm doing it with:
str.gsub(/;/, ' -D').gsub(/^/, ' -D')
the second gsub is only to add the -D also to the beginn of line
Result:
-DTEST -DNAME=1 -DTARGET_SOMETHING -DPLATFORM_INTEL -D
How to tell Ruby not to output the last "-D" or to delete the last ";" in str?
Any suggestions to do it in the same line?

You can combine split and map for this.
irb(main):012:0> str.split(";").map {|i| "-D#{i}"}.join(" ")
=> "-DTEST -DNAME=1 -DTARGET_SOMETHING -DPLATFORM_INTEL"

elements= (str.gsub(/;/, ' -D').gsub(/^/, ' -D')).split(' ')
output will be:
["-DTEST", "-DNAME=1", "-DTARGET_SOMETHING", "-DPLATFORM_INTEL", "-D"]
then delete last element from an array:
elements.delete_at(elements.size-1)
output will be in elements variable
p elements
["-DTEST", "-DNAME=1", "-DTARGET_SOMETHING", "-DPLATFORM_INTEL"]

Related

lowercase and remove punctuation from a csv

I have a giant file (6gb) which is a csv and the rows look like so:
"87687","institute Polytechnic, Brazil"
"342424","university of India, India"
"24343","univefrsity columbia, Bogata, Colombia"
and I would like to remove all punctuation and lower the case of second column yielding:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
what would be the most efficient way to do this on the terminal?
Tried:
cat TEXTFILE | tr -d '[:punct:]' > OUTFILE
problem: resultant is not in lowercase and tr seems to act on both columns not just the ssecond.
With a real CSV parser in Perl, the robust/reliable way, using just one process.
As far as it's line by line, the 6GB requirement of file size should not be an issue.
#!/usr/bin/perl
use strict; use warnings; # harness
use Text::CSV; # load the needed module (install it)
use feature qw/say/; # say = print("...\n")
# create an instance of a new CSV parser
my $csv = Text::CSV->new({ auto_diag => 1 });
# open a File Handle or exit with error
open my $fh, "<:encoding(utf8)", "file.csv" or die "file.csv: $!";
while (my $row = $csv->getline ($fh)) { # parse line by line
$_ = $row->[1]; # parse only column 2
s/[\s[:punct:]]//g; # removes both space(s) and punct(s)
$_ = lc $_; # Lower Case current value $_
$row->[1] = qq/"$_"/; # edit changes and (re)"quote"
say join ",", #$row; # display the whole current row
}
close $fh; # close the File Handle
Output
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
install
cpan Text::CSV
Here's an approach using xsv and process substitution:
paste -d, \
<(xsv select 1 infile.csv) \
<(xsv select 2 infile.csv | sed 's/[[:blank:][:punct:]]*//g;s/.*/\L&/')
The sed command first removes all blanks and punctuation, then lowercases the entire match.
This also works when the first field contains blanks and commas, and retains quoting where required.
Using sed
$ sed -E ':a;s/([^,]*,)([^ ,]*)[ ,]([[:alpha:]]+)/\1\L\2\3/;ta' input_file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia
I suggest using this awk solution, which should work with any version of awk:
awk 'BEGIN{FS=OFS="\",\""} {
gsub(/[^[:alnum:]"]+/, "", $2); $2 = tolower($2)} 1' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Details:
We make "," input and output field separators in BEGIN block
gsub(/[^[:alnum:]"]+/, "", $2): Strip all non-alphanumeric characters except "
$2 = tolower($2): Lowercase second column
One GNU awk (for gensub()) idea:
awk '
BEGIN { FS=OFS="\"" }
{ $4=gensub(/[^[:alnum:]]/,"","g",tolower($4)) }
1'
This generates:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Another sed approach -
sed -E 's/ +//g; s/([^"]),/\1/g; s/"([^"]*)"/"\L\1"/g' file
I don't like how that leaves no flexibility, and makes you rewrite the logic if you find something else you want to remove, though.
Another in awk -
awk -F'[", ]+' '
{ printf "\"%s\",\"", $2;
for(c=3;c<=NF;c++) printf "%s", tolower($c);
print "\"";
}' file
This approach lets you define and add any additional offending characters into the field delimiters without editing your logic.
$: pat=$"[\"',_;:!##\$%)(* -]+"
$: echo "$pat"
["',_;:!##$%)(* -]+
$: cat file
"87687","institute 'Polytechnic, Brazil"
"342424","university; of-India, India"
"24343","univefrsity )columbia, Bogata, Colombia"
$: awk -F"$pat" '{printf "\"%s\",\"", $2; for(c=3;c<=NF;c++) printf "%s", tolower($c); print "\"" }' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
(I hate the way that lone single quote throws the markup color/format parsing off, lol)
Another way using ruby. Edited the data to show only the second field is modified.
% ruby -r 'csv' -e 'f = open("file");
CSV.parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Using FastCSV gives a huge speedup
gem install fastcsv
% ruby -r 'fastcsv' -e 'f = open("file");
FastCSV.raw_parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Data
% cat file
"8768, 7","institute Polytechnic, Brazil"
"342 424","university of India, India"
"243 43","univefrsity columbia, Bogata, Colombia"
With your shown samples and attempts please try following GNU awk code using match function of it. Using regex (^"[^"]*",")([^"]*)(".*)$ in match function which will create 3 capturing groups and will store the value into arr and respectively I am fetching the values of it later in program to meet OP's requirement.
awk '
match($0,/(^"[^"]*",")([^"]*)(".*)$/,arr){
gsub(/[^[:alnum:]]+/,"",arr[2])
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
This might work for you (GNU sed):
sed -E s'/("[^"]*",)/\1\n/;h;s/.*\n//;s/[[:punct:] ]//g;s/.*/"\L&"/;H;g;s/\n.*\n//' file
Divide and rule.
Partition the line into two fields, make a copy, process the second field removing punctuation and spaces, re-quote and lowercase and then re-assemble the fields
An alternative, perhaps?
sed -E ':a;s/^("[^"]*",".*)[^[:alpha:]"](.*)/\L\1\2/;ta' file
Here is a way to do so in PHP.
Note: PHP will not output double quotes unless needed by the first column. The second column will never need double quotes, it has no space or special characters.
$max_line_length = 100;
if (($fp = fopen("file.csv", "r")) !== FALSE) {
while (($data = fgetcsv($fp, $max_line_length, ",")) !== FALSE) {
$data[1] = strtolower(preg_replace('/[\s[:punct:]]/', '', $data[1]));
fputcsv(STDOUT, $data, ',', '"');
}
fclose($fp);
}

How i should use sed for delete specific strings and allow duplicate with more characters?

i had generate a list of file, and this had 17417 lines like :
./usr
./usr/share
./usr/share/mime-info
./usr/share/mime-info/libreoffice7.0.mime
./usr/share/mime-info/libreoffice7.0.keys
./usr/share/appdata
./usr/share/appdata/libreoffice7.0-writer.appdata.xml
./usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
./usr/share/appdata/libreoffice7.0-draw.appdata.xml
./usr/share/appdata/libreoffice7.0-impress.appdata.xml
./usr/share/appdata/libreoffice7.0-base.appdata.xml
./usr/share/appdata/libreoffice7.0-calc.appdata.xml
./usr/share/applications
./usr/share/applications/libreoffice7.0-xsltfilter.desktop
./usr/share/applications/libreoffice7.0-writer.desktop
./usr/share/applications/libreoffice7.0-base.desktop
./usr/share/applications/libreoffice7.0-math.desktop
./usr/share/applications/libreoffice7.0-startcenter.desktop
./usr/share/applications/libreoffice7.0-calc.desktop
./usr/share/applications/libreoffice7.0-draw.desktop
./usr/share/applications/libreoffice7.0-impress.desktop
./usr/share/icons
./usr/share/icons/gnome
./usr/share/icons/gnome/16x16
./usr/share/icons/gnome/16x16/mimetypes
./usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
The thing is i want to delete the lines like :
./usr
./usr/share
./usr/share/mime-info
./usr/share/appdata
./usr/share/applications
./usr/share/icons
./usr/share/icons/gnome
./usr/share/icons/gnome/16x16
./usr/share/icons/gnome/16x16/mimetypes
and the "." at the start, for the result must be like :
/usr/share/mime-info/libreoffice7.0.mime
/usr/share/mime-info/libreoffice7.0.keys
/usr/share/appdata/libreoffice7.0-writer.appdata.xml
/usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
/usr/share/appdata/libreoffice7.0-draw.appdata.xml
/usr/share/appdata/libreoffice7.0-impress.appdata.xml
/usr/share/appdata/libreoffice7.0-base.appdata.xml
/usr/share/appdata/libreoffice7.0-calc.appdata.xml
/usr/share/applications/libreoffice7.0-xsltfilter.desktop
/usr/share/applications/libreoffice7.0-writer.desktop
/usr/share/applications/libreoffice7.0-base.desktop
/usr/share/applications/libreoffice7.0-math.desktop
/usr/share/applications/libreoffice7.0-startcenter.desktop
/usr/share/applications/libreoffice7.0-calc.desktop
/usr/share/applications/libreoffice7.0-draw.desktop
/usr/share/applications/libreoffice7.0-impress.desktop
/usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
This is possible using sed ? or is more practical using another tool
With your list in the filename list, you could do:
sed -n 's/^[.]//;/\/.*[._].*$/p' list
Where:
sed -n suppresses printing of pattern-space; then
s/^[.]// is the substitution form that simply removes the first character '.' from each line; then
/\/.*[._].*$/p matches line that contain a '.' or '_' (optional) after the last '/' with p causing that line to be printed.
Example Use/Output
$ sed -n 's/^[.]//;/\/.*[._].*$/p' list
/usr/share/mime-info/libreoffice7.0.mime
/usr/share/mime-info/libreoffice7.0.keys
/usr/share/appdata/libreoffice7.0-writer.appdata.xml
/usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
/usr/share/appdata/libreoffice7.0-draw.appdata.xml
/usr/share/appdata/libreoffice7.0-impress.appdata.xml
/usr/share/appdata/libreoffice7.0-base.appdata.xml
/usr/share/appdata/libreoffice7.0-calc.appdata.xml
/usr/share/applications/libreoffice7.0-xsltfilter.desktop
/usr/share/applications/libreoffice7.0-writer.desktop
/usr/share/applications/libreoffice7.0-base.desktop
/usr/share/applications/libreoffice7.0-math.desktop
/usr/share/applications/libreoffice7.0-startcenter.desktop
/usr/share/applications/libreoffice7.0-calc.desktop
/usr/share/applications/libreoffice7.0-draw.desktop
/usr/share/applications/libreoffice7.0-impress.desktop
/usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
Note, without GNU sed that allows chaining of expressions with ';' you would need:
sed -n -e 's/^[.]//' -e '/\/.*[._].*$/p' list
Assuming you want to delete the line(s) which is included other
pathname(s), would you please try:
sort -r list.txt | awk ' # sort the list in the reverse order
{
sub("^\\.", "") # remove leading dot
s = prev; sub("/[^/]+$", "", s) # remove the rightmost slash and following characters
if (s != $0) print # if s != $0, it means $0 is not a substring of the previous line
prev = $0 # keep $0 for the next line
}'
Result:
/usr/share/mime-info/libreoffice7.0.mime
/usr/share/mime-info/libreoffice7.0.keys
/usr/share/icons/gnome/16x16/mimetypes/libreoffice7.0-oasis-formula.png
/usr/share/applications/libreoffice7.0-xsltfilter.desktop
/usr/share/applications/libreoffice7.0-writer.desktop
/usr/share/applications/libreoffice7.0-startcenter.desktop
/usr/share/applications/libreoffice7.0-math.desktop
/usr/share/applications/libreoffice7.0-impress.desktop
/usr/share/applications/libreoffice7.0-draw.desktop
/usr/share/applications/libreoffice7.0-calc.desktop
/usr/share/applications/libreoffice7.0-base.desktop
/usr/share/appdata/org.libreoffice7.0.kde.metainfo.xml
/usr/share/appdata/libreoffice7.0-writer.appdata.xml
/usr/share/appdata/libreoffice7.0-impress.appdata.xml
/usr/share/appdata/libreoffice7.0-draw.appdata.xml
/usr/share/appdata/libreoffice7.0-calc.appdata.xml
/usr/share/appdata/libreoffice7.0-base.appdata.xml

remove only *some* fullstops from a csv file

If I have lines like the following:
1,987372,987372,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,.,1.293,12.23,0.989,0.973,D,.,.,.,.,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
how can I replace all instances of ,., with ,?,
I want to preserve actual decimal places in the numbers so I can't just do
sed 's/./?/g' file
however when doing:
sed 's/,.,/,?,/g' file
this only appears to work in some cases. i.e. there are still instances of ,., hanging around.
anyone have any pointers?
Thanks
This should work :
sed ':a;s/,\.,/,?,/g;ta' file
With successive ,., strings, after a substitution succeeded, next character to be processed will be the following . that doesn't match the pattern, so with you need a second pass.
:a is a label for upcoming loop
,\., will match dot between commas. Note that the dot must be escaped because . is for matching any character (,a, would match with ,.,).
g is for general substitution
ta tests previous substitution and if it succeeded, loops to :a label for remaining substitutions.
Using sed it is possible by running a loop as shown in above answer however problem is easily solved using perl command line with lookarounds:
perl -pe 's/(?<=,)\.(?=,)/?/g' file
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
This command doesn't need a loop because instead of matching surrounding commas we're just asserting their position using a lookbehind and lookahead.
All that's necessary is a single substitution
$ perl -pe 's/,\.(?=,)/,?/g' dots.csv
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
You have an example using sed style regular expressions. I'll offer an alternative - parse the CSV, and then treat each thing as a 'field':
#!/usr/bin/perl
use strict;
use warnings;
#iterate input row by row
while ( <DATA> ) {
#remove linefeeds
chomp;
#split this row on ,
my #row = split /,/;
#iterate each field
foreach my $field ( #row ) {
#replace this field with "?" if it's "."
$field = "?" if $field eq ".";
}
#stick this row together again.
print join ",", #row,"\n";
}
__DATA__
1,987372,987372,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,.,1.293,12.23,0.989,0.973,D,.,.,.,.,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
This is more verbose than it needs to be, to illustrate the concept. This could be reduced down to:
perl -F, -lane 'print join ",", map { $_ eq "." ? "?" : $_ } #F'
If your CSV also has quoting, then you can break out the Text::CSV module, which handles that neatly.
You just need 2 passes since the trailing , found on a ,., match isn't available to match the leading , on the next ,.,:
$ sed 's/,\.,/,?,/g; s/,\.,/,?,/g' file
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
The above will work in any sed on any OS.

Replacing escape quotes with just quotes in a string

So I'm having an issue replacing \" in a string.
My Objective:
Given a string, if there's an escaped quote in the string, replace it with just a quote
So for example:
"hello\"74" would be "hello"74"
simp"\"sons would be simp"sons
jump98" would be jump98"
I'm currently trying this: but obviously that doesn't work and messes everything up, any assistance would be awesome
str.replace "\\"", "\""
I guess you are being mistaken by how \ works. You can never define a string as
a = "hello"74"
Also escape character is used only while defining the variable its not part of the value. Eg:
a = "hello\"74"
# => "hello\"74"
puts a
# hello"74
However in-case my above assumption is incorrect following example should help you:
a = 'hello\"74'
# => "hello\\\"74"
puts a
# hello\"74
a.gsub!("\\","")
# => "hello\"74"
puts a
# hello"74
EDIT
The above gsub will replace all instances of \ however OP needs only to replace '" with ". Following should do the trick:
a.gsub!("\\\"","\"")
# => "hello\"74"
puts a
# hello"74
You can use gsub:
word = 'simp"\"sons';
print word.gsub(/\\"/, '"');
//=> simp""sons
I'm currently trying str.replace "\\"", "\"" but obviously that doesn't work and messes everything up, any assistance would be awesome
str.replace "\\"", "\"" doesn't work for two reasons:
It's the wrong method. String#replace replaces the entire string, you are looking for String#gsub.
"\\"" is incorrect: " starts the string, \\ is a backslash (correctly escaped) and " ends the string. The last " starts a new string.
You have to either escape the double quote:
puts "\\\"" #=> \"
Or use single quotes:
puts '\\"' #=> \"
Example:
content = <<-EOF
"hello\"74"
simp"\"sons
jump98"
EOF
puts content.gsub('\\"', '"')
Output:
"hello"74"
simp""sons
jump98"

Bash array + sed + html

I need change price the HTML file, which search and store them in array but I have to change and save /nuevo-focus.html
price=( `cat /home/delkav/info-sitioweb/html/productos/autos/nuevo-focus.html | grep -oiE '([$][0-9.]{1,7})'|tr '\n' ' '` )
price2=( $90.880 $0 $920 $925 $930 $910 $800 $712 $27.220 $962 )
sub (){
for item in "${price[#]}"; do
for x in ${price2[#]}; do
sed s/$item/$x/g > /home/delkav/info-sitioweb/html/productos/autos/nuevo-focus.html
done
done
}
sub
Output the "cat /home/.../nuevo-focus.html|grep -oiE '([$][0-9.]{1,7})'|tr '\n' ' '` )" is...
$86.880 $0 $912 $908 $902 $897 $882 $812 $25.725 $715
In bash the variables $0 through $9 refer to the respective command line arguments of the script being run. In the line:
price2=( $90.880 $0 $920 $925 $930 $910 $800 $712 $27.220 $962 )
They will be expanded to either empty strings or the command line arguments that you gave the script.
Try doing this instead:
price2=( '$90.880' '$0' '$920' '$925' '$930' '$910' '$800' '$712' '$27.220' '$962' )
EDIT for part two of question
If what you are trying to do with the sed line is replace the prices in the file, overwriting the old ones, then you should do this:
sed -i s/$item/$x/g /home/delkav/info-sitioweb/html/productos/autos/nuevo-focus.html
This will perform the substitution in place (-i), modifying the input file.
EDIT for part three of the question
I just realized that your nested loop does not really make sense. I am assuming that what you want to do is replace each price from price with the corresponding price in price2
If that is the case, then you should use a single loop, looping over the indices of the array:
for i in ${!price[*]}
do
sed -i "s/${price[$i]}/${price2[$i]}/g" /home/delkav/info-sitioweb/html/productos/autos/nuevo-focus.html
done
I'm not able to test that right now, but I think it should accomplish what you want.
To explain it a bit:
${!price[*]} gives you all of the indices of your array (e.g. 0 1 2 3 4 ...)
For each index we then replace the corresponding old price with the new one. There is no need for a nested loop as you have done. When you execute that, what you are Basically doing is this:
replace every occurence of "foo" with "bar"
# at this point, there are now no more occurences of "foo" in your file
# so all of the other replacements do nothing
replace every occurence of "foo" with "baz"
replace every occurence of "foo" with "spam"
replace every occurence of "foo" with "eggs"
replace every occurence of "foo" with "qux"
replace every occurence of "foo" with "whatever"
etc...

Resources