multi search and replace with sed or similar commands in a unicode environment

multi search and replace with sed or similar commands in a unicode environment - windows

I have some *.txt files , places in c:\apple and it sub directories in WINDOWS 7 environment.
eg:
c:\apple\orange
c:\apple\pears ....etc
but numbers of subfolders in c:\apple are unknown
and I have a text file (say sample.txt) , which something like a config file, the structure is :
綫 綫
胆 胆
湶 湶
峯 峯
one space between the chinese char and the string.
I hope I can use this file sample.txt file , to search ALL THE text files in the C:\APPLE\ and it subdirectories , find out those chinese characters and replace with the characters after that.
I have tried sed but no luck on chinese characters.
sed -r "s/^(.*) (.*)/s#\1#\2#/g" c:\temp\sample.txt *.txt
Any one have an idea?

Assuming your text files including sample.txt are encoded with UTF-16LE, please try:
perl -e '
use utf8;
use File::Find;
$topdir = "c:/apple"; # top level of subfolders
$mapfile = "c:/temp/sample.txt"; # config file to map character to code
$enc = "utf16le"; # character coding of texts
open(FH, "<:encoding($enc)", $mapfile) or die "$mapfile: $!";
while (<FH>) {
#_ = split(" ");
$map{$_[0]} = $_[1];
}
close(FH);
find(\&process, $topdir);
sub process {
my $file = $_;
if (-f $file && $file =~ /\.txt$/) {
my $tmp = "$file.tmp";
my $lines = "";
open(FH, "<:encoding($enc)", $file) or die "$file: $!";
open(W, ">:encoding($enc)", $tmp) or die "$tmp: $!";
while (<FH>) {
$lines .= $_; # slurp all text
}
foreach $key (keys %map) {
$lines =~ s/$key/$map{$key}/ge;
}
print W $lines;
close(FH);
close(W);
rename $file, "$file.bak"; # back-up original file
rename $tmp, $file;
}
}'
I need to tell you I have not tested the code on Windows execution environments (it is tested on Linux with Windows files). If it has some problems, please let me know. You may need to modify the assignments to $topdir, $mapfile, or $enc.

Related

Can I find similar named files ignoring case, dashes, spaces or other characters?

EDIT 2:
lets say I have 2 directories one contains:
/dir1/Test File Name.txt
/dir1/This is anotherfile.txt
/dir1/And-Another File.txt
Directory 2 looks like:
/dir2/test-File_Name.txt
/dir2/test file_Name.txt
/dir2/This Is another file.txt
/dir2/And another_file.txt
How can I find (or match) files that are named similar, in this example file 1 from dir1 would match with file 1 and 2 on dir2 and so on
Trying to do this in bash. Say I have a file named "Test File 1.txt" I want to find any file that is named similar like:
test-file 1.txt
test file 1.txt
Test-file-1.txt
test-file_1.zip
etc etc
I can ignore case with find ./files/ -maxdepth 1 -iname $FILE but don't know how to ignore all the other characters.
Is there a way I can do this in bash?
EDIT:
Sorry, I forgot to mention that I need to iterate on all files, the file name is not always the same, I just used an example.
so it could be named "Test File 1.txt" or it could also be named something completely different "Something Else.txt"
So I want to look for all similar named files using a complete file name as base, but this file name can be different, hope I make more sense.

If Perl is your option, please try the following:
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
# s#\..*?$##; # remove extension (wrong)
s#\.[^.]*$##; # remove extension (corrected)
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\.\\w+\$";
# print $pat, "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
print "$f1 <=> $f2\n";
}
}
}'
Output:
dir1/And-Another File.txt <=> dir2/And another_file.txt
dir1/Test File Name.txt <=> dir2/test file_Name.txt
dir1/Test File Name.txt <=> dir2/test-File_Name.txt
dir1/This is anotherfile.txt <=> dir2/This Is another file.txt
[Explanations]
The concept is to generate a regex pattern on the fly from a filename
in one directory and match it with the files in the other directory.
File extension is replaced with a pattern which matches it.
Non-alphanumeric character and underscore are replaced with a pattern
which matches them including the case the character is missing so that
anotherfile and another file match.
i option added to the pattern enables case-insensitive match.
You can see the generated regex by uncommenting the noted line.
The possible problem is we can not generate a pattern which matches with
another file from the filename anotherfile. In other words, the
matching is one-directional. A possible workaround is to neglect non-alphanumeric characters and underscores at all in matching. It may result in unexpected overmatching depending on the word and punctuation. We will need to specifically define the similarity to step further.
[Edit]
In order to get the result back to bash variables, please try:
while read -r -d "" line; do
# do something with the bash variable "line"
echo "$line"
done < <(
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
# s#\..*?$##; # remove extension (wrong)
s#\.[^.]*$##; # remove extension (corrected)
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\.\\w+\$";
# print $pat, "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
push(#result, "$f1 <=> $f2");
# if you want just the list of filenames, comment out the line above
# and uncomment the line below
#push(#result, $f1, $f2);
}
}
}
print join("\0", #result) . "\0";
')
The results is stored in the bash variable line in line by line.
If you want to tweak the output format, please modify the line push(#result, ...).
[EDIT]
Modified to work with the following filename pairs:
"Sample Filename.txt" <=> "Sample Filename (100).txt"
"Sample.Filename.txt" <=> "Sample Filename.txt"
Here's the updated code:
while read -r -d "" line; do
# do something with the bash variable "line"
echo $line
done < <(
perl -e '
#files1 = glob "dir1/*";
#files2 = glob "dir2/*";
foreach (#files2) {
$f2 = $_;
s#.*/##; # remove directory name
s#\.[^.]*$##; # remove extension
s#\s*\(.*?\)##; # remove parenthesis if any
s#\s*\[.*?\]##; # remove square bracket if any
s#[\W_]#[\\W_]?#g; # replace non-alphanumric chars
$pat = $_ . "\\s?((\\(.*?\\))|(\\[.*?\\]))?" . "\\.\\w+\$";
#print $pat . "\n"; # uncomment to see the regex pattern
foreach $f1 (#files1) {
if ($f1 =~ m#/$pat#i) {
push(#result, "$f1 <=> $f2");
# if you want just the list of filenames, comment out the line above
# and uncomment the line below
#push(#result, $f1, $f2);
}
}
}
print join("\0", #result) . "\0";
')

Find if null exists in csv file

I have a csv file. The file has some anomalies as it contains some unknown characters.
The characters appear at line 1535 in popular editors (images attached below). The sed command in the terminal for this linedoes not show anything.
$ sed '1535!d' sample.csv
"sample_id","sample_column_text_1","sample_"sample_id","sample_column_text_1","sample_column_text_2","sample_column_text_3"
However below are the snapshots of the file in various editors.
Sublime Text
Nano
Vi
The directory has various csv files that contain this character/chain of characters.
I need to write a bash script to determine the files that have such characters. How can I achieve this?

The following is from;
http://www.linuxquestions.org/questions/programming-9/how-to-check-for-null-characters-in-file-509377/
#!/usr/bin/perl -w
use strict;
my $null_found = 0;
foreach my $file (#ARGV) {
if ( ! open(F, "<$file") ) {
warn "couldn't open $file for reading: $!\n";
next;
}
while(<F>) {
if ( /\000/ ) {
print "detected NULL at line $. in file $file\n";
$null_found = 1;
last;
}
}
close(F);
}
exit $null_found;
If it works as desired, you can save it to a file, nullcheck.pl and make it executable;
chmod +x nullcheck.pl
It seems to take an array of files names as input, but will fail if it finds in any, so I'd only pass in one at a time. The command below is used to run the script.
for f in $(find . -type f -exec grep -Iq . {} \; -and -print) ; do perl ./nullcheck.pl $f || echo "$f has nulls"; done
The above find command is lifted from Linux command: How to 'find' only text files?

You can try tr :
grep '\000' filename to find if the files contain the \000 characters.
You can use this to remove NULL and make it non-NULL file :
tr < file-with-nulls -d '\000' > file-without-nulls

Extracting the first two characters from a file in perl into another file

I'm having a little bit of trouble with my code below -- I'm trying to figure out how to open up all these text files (.csv files that end in DIS that all have one line in them) and get the first two characters (these are all numbers) from them and print them into another file of the same name, with a ".number" suffix. Some of these .DIS files don't have anything in them, in which case I want to print "0".
Lastly, I would like to go through each original .DIS file and delete the first 3 characters -- I did this through bash.
my #DIS = <*.DIS>;
foreach my $file (#DIS){
my $name = $file;
my $output = "$name.number";
open(INHANDLE, "< $file") || die("Could not open file");
while(<INHANDLE>){
open(OUT_FILE,">$output") || die;
my $line = $_;
chomp ($line);
my $string = $line;
if ($string eq ""){
print "0";
} else {
print substr($string,0,2);
}
}
system("sed -i 's/\(.\{3\}\)//' $file");
}
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files. I'm rather new to Perl, so any help would be appreciated!

When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files.
This is because of this line.
print substr($string,0,2);
print defaults to printing to STDOUT (ie. the screen). You need to give it the filehandle to print to.
print OUT_FILE substr($string,0,2);
They're being concatenated because print just prints what you tell it to, it won't put newlines in for you (there are some global variables which can change this, don't mess with them). You have to add the newline yourself.
print OUT_FILE substr($string,0,2), "\n";
As a final note, when working with files in Perl I would suggest using lexical filehandles, Path::Tiny, and autodie. They will avoid a great number of classic problems working with files in Perl.

I suggest you do it like this
Each *.dis file is opened and the contents read into $text. Then a regex substitution is used to remove the first three characters from the string and capture the first two in $1
If the substitution succeeded then the contents of $1 are written to the number file, otherwise the original file is empty (or shorter than two characters) and a zero is written instead. The remaining contents of $text are then written back to the *.dis file
use strict;
use warnings;
use v5.10.1;
use autodie;
for my $dis_file ( glob '*.DIS' ) {
my $text = do {
open my $fh, '<', $dis_file;
<$fh>;
};
my $num_file = "$dis_file.number";
open my $dis_fh, '>', $dis_file;
open my $num_fh, '>', $num_file;
if ( defined $text and $text =~ s/^(..).?// ) {
print $num_fh "$1\n";
print $dis_fh $text;
}
else {
print $num_fh "0\n";
print $dis_fh "-\n";
}
}

this awk script extract the first two chars of each file to it's own file. Empty files expected to have one empty line based on the spec.
awk 'FNR==1{pre=substr($0,1,2);pre=length(pre)==2?pre:0; print pre > FILENAME".number"}' *.DIS
This will remove the first 3 chars
cut -c 4-
Bash for loop will be better to do both, which we'll need to modify the awk script little bit
for f in *.DIS;
do awk 'NR==1{pre=substr($0,1,2);$0=length(pre)==2?pre:0; print}' $f > $f.number;
cut -c 4- $f > $f.cut;
done
explanation: loop through all files in *.DTS, for the first line of each file, try to get first two chars (1,2) of the line ($0) assign to pre. If the length of pre is not two (either the line is empty or with 1 char only) set the line to 0 or else use pre; print the line, output file name will be input file appended with .number suffix. The $0 assignment is a trick to save couple keystrokes since print without arguments prints $0, otherwise you can provide the argument.
Ideally you should quote "$f" since it may contain space in file name...

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)

If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv

You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Bash script optimisation

This is the script in question:
for file in `ls products`
do
echo -n `cat products/$file \
| grep '<td>.*</td>' | grep -v 'img' | grep -v 'href' | grep -v 'input' \
| head -1 | sed -e 's/^ *<td>//g' -e 's/<.*//g'`
done
I'm going to run it on 50000+ files, which would take about 12 hours with this script.
The algorithm is as follows:
Find only lines containing table cells (<td>) that do not contain any of 'img', 'href', or 'input'.
Select the first of them, then extract the data between the tags.
The usual bash text filters (sed, grep, awk, etc.) are available, as well as perl.

Looks like that can all be replace by one gawk command:
gawk '
/<td>.*<\/td>/ && !(/img/ || /href/ || /input/) {
sub(/^ *<td>/,""); sub(/<.*/,"")
print
nextfile
}
' products/*
This uses the gawk extension nextfile.
If the wildcard expansion is too big, then
find products -type f -print | xargs gawk '...'

Here's some quick perl to do the whole thing that should be alot faster.
#!/usr/bin/perl
process_files($ARGV[0]);
# process each file in the supplied directory
sub process_files($)
{
my $dirpath = shift;
my $dh;
opendir($dh, $dirpath) or die "Cant readdir $dirpath. $!";
# get a list of files
my #files;
do {
#files = readdir($dh);
foreach my $ent ( #files ){
if ( -f "$dirpath/$ent" ){
get_first_text_cell("$dirpath/$ent");
}
}
} while ($#files > 0);
closedir($dh);
}
# return the content of the first html table cell
# that does not contain img,href or input tags
sub get_first_text_cell($)
{
my $filename = shift;
my $fh;
open($fh,"<$filename") or die "Cant open $filename. $!";
my $found = 0;
while ( ( my $line = <$fh> ) && ( $found == 0 ) ){
## capture html and text inside a table cell
if ( $line =~ /<td>([&;\d\w\s"'<>]+)<\/td>/i ){
my $cell = $1;
## omit anything with the following tags
if ( $cell !~ /<(img|href|input)/ ){
$found++;
print "$cell\n";
}
}
}
close($fh);
}
Simply invoke it by passing the directory to be searched as the first argument:
$ perl parse.pl /html/documents/

What about this (should be much faster and clearer):
for file in products/*; do
grep -P -o '(?<=<td>).*(?=<\/td>)' $file | grep -vP -m 1 '(img|input|href)'
done
the for will look to every file in products. See the difference with your syntax.
the first grep will output just the text between <td> and </td> without those tags for every cell as long as each cell is in a single line.
finally the second grep will output just the first line (which is what I believe you wanted to achieve with that head -1) of those lines which doesn't contain img, href or input (and will exit right then reducing the overall time allowing to process the next file faster)
I would have loved to use just a single grep, but then the regex will be really awful. :-)
Disclaimer: of course I haven't tested it

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio