I have a giant file (6gb) which is a csv and the rows look like so:
"87687","institute Polytechnic, Brazil"
"342424","university of India, India"
"24343","univefrsity columbia, Bogata, Colombia"
and I would like to remove all punctuation and lower the case of second column yielding:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
what would be the most efficient way to do this on the terminal?
Tried:
cat TEXTFILE | tr -d '[:punct:]' > OUTFILE
problem: resultant is not in lowercase and tr seems to act on both columns not just the ssecond.
With a real CSV parser in Perl, the robust/reliable way, using just one process.
As far as it's line by line, the 6GB requirement of file size should not be an issue.
#!/usr/bin/perl
use strict; use warnings; # harness
use Text::CSV; # load the needed module (install it)
use feature qw/say/; # say = print("...\n")
# create an instance of a new CSV parser
my $csv = Text::CSV->new({ auto_diag => 1 });
# open a File Handle or exit with error
open my $fh, "<:encoding(utf8)", "file.csv" or die "file.csv: $!";
while (my $row = $csv->getline ($fh)) { # parse line by line
$_ = $row->[1]; # parse only column 2
s/[\s[:punct:]]//g; # removes both space(s) and punct(s)
$_ = lc $_; # Lower Case current value $_
$row->[1] = qq/"$_"/; # edit changes and (re)"quote"
say join ",", #$row; # display the whole current row
}
close $fh; # close the File Handle
Output
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
install
cpan Text::CSV
Here's an approach using xsv and process substitution:
paste -d, \
<(xsv select 1 infile.csv) \
<(xsv select 2 infile.csv | sed 's/[[:blank:][:punct:]]*//g;s/.*/\L&/')
The sed command first removes all blanks and punctuation, then lowercases the entire match.
This also works when the first field contains blanks and commas, and retains quoting where required.
Using sed
$ sed -E ':a;s/([^,]*,)([^ ,]*)[ ,]([[:alpha:]]+)/\1\L\2\3/;ta' input_file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia
I suggest using this awk solution, which should work with any version of awk:
awk 'BEGIN{FS=OFS="\",\""} {
gsub(/[^[:alnum:]"]+/, "", $2); $2 = tolower($2)} 1' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Details:
We make "," input and output field separators in BEGIN block
gsub(/[^[:alnum:]"]+/, "", $2): Strip all non-alphanumeric characters except "
$2 = tolower($2): Lowercase second column
One GNU awk (for gensub()) idea:
awk '
BEGIN { FS=OFS="\"" }
{ $4=gensub(/[^[:alnum:]]/,"","g",tolower($4)) }
1'
This generates:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Another sed approach -
sed -E 's/ +//g; s/([^"]),/\1/g; s/"([^"]*)"/"\L\1"/g' file
I don't like how that leaves no flexibility, and makes you rewrite the logic if you find something else you want to remove, though.
Another in awk -
awk -F'[", ]+' '
{ printf "\"%s\",\"", $2;
for(c=3;c<=NF;c++) printf "%s", tolower($c);
print "\"";
}' file
This approach lets you define and add any additional offending characters into the field delimiters without editing your logic.
$: pat=$"[\"',_;:!##\$%)(* -]+"
$: echo "$pat"
["',_;:!##$%)(* -]+
$: cat file
"87687","institute 'Polytechnic, Brazil"
"342424","university; of-India, India"
"24343","univefrsity )columbia, Bogata, Colombia"
$: awk -F"$pat" '{printf "\"%s\",\"", $2; for(c=3;c<=NF;c++) printf "%s", tolower($c); print "\"" }' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
(I hate the way that lone single quote throws the markup color/format parsing off, lol)
Another way using ruby. Edited the data to show only the second field is modified.
% ruby -r 'csv' -e 'f = open("file");
CSV.parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Using FastCSV gives a huge speedup
gem install fastcsv
% ruby -r 'fastcsv' -e 'f = open("file");
FastCSV.raw_parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Data
% cat file
"8768, 7","institute Polytechnic, Brazil"
"342 424","university of India, India"
"243 43","univefrsity columbia, Bogata, Colombia"
With your shown samples and attempts please try following GNU awk code using match function of it. Using regex (^"[^"]*",")([^"]*)(".*)$ in match function which will create 3 capturing groups and will store the value into arr and respectively I am fetching the values of it later in program to meet OP's requirement.
awk '
match($0,/(^"[^"]*",")([^"]*)(".*)$/,arr){
gsub(/[^[:alnum:]]+/,"",arr[2])
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
This might work for you (GNU sed):
sed -E s'/("[^"]*",)/\1\n/;h;s/.*\n//;s/[[:punct:] ]//g;s/.*/"\L&"/;H;g;s/\n.*\n//' file
Divide and rule.
Partition the line into two fields, make a copy, process the second field removing punctuation and spaces, re-quote and lowercase and then re-assemble the fields
An alternative, perhaps?
sed -E ':a;s/^("[^"]*",".*)[^[:alpha:]"](.*)/\L\1\2/;ta' file
Here is a way to do so in PHP.
Note: PHP will not output double quotes unless needed by the first column. The second column will never need double quotes, it has no space or special characters.
$max_line_length = 100;
if (($fp = fopen("file.csv", "r")) !== FALSE) {
while (($data = fgetcsv($fp, $max_line_length, ",")) !== FALSE) {
$data[1] = strtolower(preg_replace('/[\s[:punct:]]/', '', $data[1]));
fputcsv(STDOUT, $data, ',', '"');
}
fclose($fp);
}
I have been lately using unicode more often and wondered if there is a command line tool to convert unicode between its forms.
Would be nice to be able to say:
uni_convert "☃" --string
And know that the string is defined in unicode as a "SNOWMAN".
Perl's Unicode-Tussle distribution comes with the useful uniprops.
$ uniprops '☃'
U+2603 ‹☃› \N{SNOWMAN}
...
$ uniprops 'U+2603'
U+2603 ‹☃› \N{SNOWMAN}
...
$ uniprops 'SNOWMAN'
U+2603 ‹☃› \N{SNOWMAN}
...
If you're writing code, you'll want charnames.
Want
Have
Code
$code
$char
ord($char)
$code
$name
charnames::vianame($name)
$char
$code
chr($code)
$char
$name
chr(charnames::vianame($name))
$name
$code
charnames::viacode($code)
$name
$char
charnames::viacode(ord($char))
vianame accepts official aliases (e.g. LF for LINEFEED). You'll need to parse U+ notation yourself if wish to accept it. ($code = hex(s/^U\+//r);)
Example:
use strict;
use warnings;
use feature qw( say );
use experimental qw( regex_sets ); # Safe. Optional since 5.36.
use utf8; # Source encoded using UTF-8.
use open ":std", ":encoding(UTF-8)"; # Terminal provides/expects UTF-8.
use charnames qw( :full );
use Encode qw( decode_utf8 );
#ARGV == 1
or die("usage\n");
my $s = decode_utf8($ARGV[0]);
for my $cp ( unpack "W*", $s ) {
my $ch = chr($cp);
if ( $ch =~ /(?[ \p{Print} - \p{Mark} ])/ ) { # Not sure if good enough.
printf "‹%s› ", $ch;
} else {
print "--- ";
}
printf "U+%X ", $cp;
say charnames::viacode($cp);
}
$ uni_id ☃
‹☃› U+2603 SNOWMAN
$ uni_id çà
‹ç› U+E7 LATIN SMALL LETTER C WITH CEDILLA
‹à› U+E0 LATIN SMALL LETTER A WITH GRAVE
Other resources:
Unicode::UCD
Provides accsess at the information found in the Unicode Character Database.
The Unicode Standard is more than characters and properties.
perluniprops
unichars from Unicode-Tussle (e.g. unichars '\p{Hiragana}')
Here is an awk to do that.
Download this file from unicode.org that provides the latest names.
Then:
q=$(printf '%x\n' \'☃)
awk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
names[$1]=str
}
END{ print names[q] }
' q="$q" names.txt
Prints:
SNOWMAN
If you want to go the other way:
cp=$(awk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
other_names[str]=$1
}
END{ print other_names[q] }
' q="SNOWMAN" names.txt)
echo -e "\u${cp}"
Prints:
☃
If you have GNU awk you can easily convert the hex index into decimal and can print from within. This allows a single source file to be used and go one way or the other by defining q or r:
gawk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
names[$1]=str
other_names[str]=$1
}
END{ print q ? names[q] : sprintf("%c", strtonum("0x" other_names[r])) }
' r='SNOWMAN' names.txt
☃
gawk '/^[[:xdigit:]]+/{
str=$0
sub(/^[[:xdigit:]]+[[:blank:]]+/,"",str)
names[$1]=str
other_names[str]=$1
}
END{ print q ? names[q] : sprintf("%c", strtonum("0x" other_names[r])) }
' q=$(printf '%x\n' \'☃) names.txt
SNOWMAN
I separated the code into a file and created a repo:
https://github.com/poti1/uni_convert
There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action
You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).
I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt
Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"
I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:
I have this simple plat file (file.txt)
a43
test1
abc
cvb
bnm
test2
test1
def
ijk
xyz
test2
kfo
I need all lines between test1 and test2 in two forms, the firte one create two new files like
newfile1.txt :
test1
abc
cvb
bnm
test2
newfile2.txt
test1
def
ijk
xyz
test2
and the second form create only one new file like :
newfile.txt
test1abccvbbnmtest2
test1defijkxyztest2
Do you have any propositions?
EDIT
For the second form. I used this
sed -n '/test1/,/test2/p' file.txt > newfile.txt
But it give me a result like
test1abccvbbnmtest2test1defijkxyztest2
I need a return line like :
test1abccvbbnmtest2
test1defijkxyztest2
You can use this awk:
awk -v fn="newfile.txt" '/test1/ {
f="newfile" ++n ".txt";
s=1
} s {
print > f;
printf "%s", $0 > fn
} /test2/ {
close(f);
print "" > fn;
s=0
} END {
close(fn)
}' file
Perl, like sed and other languages, has the ability to select ranges of lines from a file, so it's a good fit for what you're trying to do.
This solution ended up being a lot more complicated than I thought it would be. I see no good reason to use it over #anubhava's awk solution. But I wrote it, so here it is:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use constant {
RANGE_START => qr/\Atest1\z/,
RANGE_END => qr/\Atest2\z/,
SUMMARY_FILE => 'newfile.txt',
GROUP_FILE => 'newfile%d.txt'
};
my $n = 1; # starting number of group file
my #wg; # storage for "working group" of lines
# Open summary file to write to.
open(my $sfh, '>', SUMMARY_FILE) or die $!;
while (my $line = <>) {
chomp $line;
# If the line is within the range, add it to our working group.
push #wg, $line if $line =~ RANGE_START .. $line =~ RANGE_END;
if ($line =~ RANGE_END) {
# We are at the end of a group, so summarize it and write it out.
unless (#wg > 2) {
# Discard any partial or empty groups.
#wg = ();
next;
}
# Write a line to the summary file.
$sfh->say(join '', #wg);
# Write out all lines to the group file.
my $group_file = sprintf(GROUP_FILE, $n);
open(my $gfh, '>', $group_file) or die $!;
$gfh->say(join "\n", #wg);
close($gfh);
printf STDERR "WROTE %s with %d lines\n", $group_file, scalar #wg;
# Get ready for the next group.
$n++;
#wg = ();
}
}
close($sfh);
printf STDERR "WROTE %s with %d groups\n", SUMMARY_FILE, $n - 1;
To use it, write the above lines into a file named e.g. ranges.pl, and make it executable with chmod +x ranges.pl. Then:
$ ./ranges.pl plat.txt
WROTE newfile1.txt with 5 lines
WROTE newfile2.txt with 5 lines
WROTE newfile.txt with 2 groups
$ cat newfile1.txt
test1
abc
cvb
bnm
test2
$ cat newfile.txt
test1abccvbbnmtest2
test1defijkxyztest2
For the second for you can add a new line after "test2" adding \n
sed -n '/test1/,/test2/p' file.txt | sed -e 's/test2/test2\n/g' > newfile.txt
sed is not useful to create multiple files so for the first one you should find another solution.
I have files that are named like C1_1_B_(1)IMG1511.jpg and I want to split them up into a list where i get back as
C1
1
B
(1)
IMG1511.jpg
trying to figure out if i need to do this with sed or awk or regex or even what that would look like i could do it in applescript but I would rather call shell command as it is much faster
EDIT
Ok so now its changed a bit and I can figure out how to fix it
example are
"P24-M_(1)Lighter_Ray_Logo_Full_Color.jpg"
"P24_(1)24x36loren.jpg"
so _(*) indicates where I want to stop listing so i end up with
P24
M
(1)
Lighter_Ray_Logo_Full_Color.jpg
and
P24
(1)
24x36loren.jpg
Translate _ to new lines:
echo "C1_1_B_(1)IMG1511.jpg" | tr '_' '\n'
Output:
C1
1
B
(1)IMG1511.jpg
Although, it looks like you want to split on ) as well. No can do with tr, but...
echo "C1_1_B_(1)IMG1511.jpg" | tr '_' '\n' | sed -e 's/)/)\
/'
There's a linefeed inside the replacement string, which is needed for Mac. On other *nix OS's, a simple escape works:
echo "C1_1_B_(1)IMG1511.jpg" | tr '_' '\n' | sed -e 's/)/)\n/'
Output:
C1
1
B
(1)
IMG1511.jpg
Would this do?
<<<"C1_1_B_(1)IMG1511.jpg" sed -r 'y/_/\n/;s/\([^)]*\)/&\n/g;'
I know it's not sed/awk, but here's something that would work in perl:
#!/usr/bin/perl
while(<STDIN>) {
my($line) = $_;
chomp($line);
my #values = split(/_|(\(\d+\))/, $line);
foreach my $val (#values) {
if ( $val !~ m/^$/)
{
print "$val\n";
}
}
}
exit 0;
If the filename is stored in $P, the following works with zsh:
myarr=${(s/_/)$(echo $P | sed 's/)/)_/g')}
This creates an actual array.
This handles filenames which contain _ ( ) in other places.
<<< '
C1_1_B_(1)IMG151).jpg
C1_1_B_(1)IMG_(4444).jpg
C(22)2_1_22_333_B_(144)I_M_G_(_1511).jpg
' sed -nr '# isolate, process and print first section
s/^([^(]+)_/\1\n/;h
s/(.*)\n.*/\1/
s/([^_]+)_/\1\n/gp;x
# process the second section
s/.*\n(.*)/\1/
s/([^)]+\))/\1\n/p
';exit
str="C1_1_B_(1)IMG1511.jpg"
ary=( $(IFS=_; echo $str) )
for ((idx=0; idx < ${#ary[#]}; idx++)); do echo ${ary[$idx]}; done
outputs
C1
1
B
(1)IMG1511.jpg