Delete strings with non-Ukrainian characters bash - bash

Using file structure
foo_11: "Марія"
foo_112: "Superman"
FOOTLONG: "Subway"
foo_13: "Юлія"
I want to remove all strings that don't have at least one character from Ukrainian alphabet.
Script:
for i in *.txt;
do
sed '/[^А-ЯЄЇІа-яєїі]+/d' $i >$i.out
mv $i.out $i
done
doesn't do anything. What is wrong?
Using mac bash.

Assuming that your character class defining Ukrainian letters is correct, the following should work:
sed '/[А-ЯЄЇІа-яєїі]/!d' file
[А-ЯЄЇІа-яєїі] matches a Ukrainian letter anywhere on the line.
Note that even the letters that look like ASCII letters A I a i are actually Ukrainian (Cyrillic) letters with Unicode codepoints U+410 U+406 U+430 U+456.
! negates the match, meaning that only lines not containing at least 1 Ukrainian letter match.
d deletes those lines.
To put it all together:
for f in *.txt; do
sed -i '' '/[А-ЯЄЇІа-яєїі]/!d' "$f" # -i '' is BSD Sed syntax; GNU sed takes just -i
done
As for what you've tried:
As #StefanHegny points out in a comment on the question, + isn't supported when sed is not run with -E in order to enable extended regular expressions; without -E, the cumbersome \{1,\} must be used. (\+ is only supported by GNU sed, not by the BSD version of sed that macOS comes with).
However, even the fixed version of your command, sed '/[^А-ЯЄЇІа-яєїі]\{1,\}/d', doesn't do what you want: it deletes all lines that contain at least one non-Ukrainian-letter character, which eliminates all of your input lines, given that they all have ASCII-based field names and contain :.
You should double-quote variable references such as $i to protect them from shell expansions: "$i"
BSD Sed does support in-place updating with -i, but - unlike GNU Sed - it requires that an empty option-argument (indicating that no backup of the input file should be made) be specified as a separate argument: -i ''.
Your write-to-a-temp-file-first-then-replace-the-original approach works too, but it's generally better to use the following idiom: sed ... file > file.tmp && mv file.tmp file. Separating the mv command with && ensures that the original file is only replaced if the sed command succeeded.
That said, that doesn't help with logic errors as in the case at hand: despite outputting nothing, sed reports success in this case.

This code would achieve what you want (if I understood your question correctly):
grep -i "Я\|Є\|Ї\|І" /folder/file >> /tmp/result
The result is stored on /tmp/result
Note: I don't know Ukranian, so I'm sure I did not included all Ukranian characters, please add/delete Ukranian characters you want to match to the construction above.
Note2: this code is case insensitive thanks to grep -i so you only need to add the character once (lowercase or capital).
To put it on your loop it could be:
for i in *.txt;
do
grep -i "Я\|Є\|Ї\|І" "$i" > "$i".out
mv "$i".out "$i"
done
Edit: I edited this answer to make it simpler, and to add a loop to it.

Related

Bash script : in file replace characters with 'X" between two given strings using sed [duplicate]

This question already has answers here:
Substitute all characters between two strings by char 'X' using sed
(3 answers)
Closed 6 years ago.
file=mylog.log
search_str="&Name="
end_str="&"
sed -i -E ':a; s/('"$search_str"'X*)[^X'"$end_str"']/\1X/; ta' "$file"
Ex 1:
something&Name=JASON&else
to
something&Name=XXXXX&else
And actually, my current sed command works fine when instead of a '"$end_str"' if I use '&' character... Like this :
sed -i -E ':a; s/('"$search_str"'X*)[^X&]/\1X/; ta' "$file"
So, to summariz, it, after ^X if a single character comes than my given sed command works fine... But the same command does not work, if instead of character, i use a string...
For example, my sed command won't work in this case :
end_str="\%26"
sed -i -E ':a; s/('"$search_str"'X*)[^X'"$end_str"']/\1X/; ta' "$file"
Eg:
something&Name=JASON_MATTHEW_DONALD%26else
TO
something&Name=XXXXXXXXXXXXXXXXXXXX%26else
Eg:2
something&Name=JASON%26else
TO
something&Name=XXXXX%26else
Please let me know
Place your string variable outside the character class and capture it to check for further substitutions:
sed -i -E ':a; s/('"$search_str"'X*)[^X](.*'"$end_str"')/\1X\2/; ta' "$file"
If the number of X doesn't matter you can simplify it to:
search="Name"
sed "s/$search=[^&]*/$search=XXX/" input.file
This assumes that $search won't contain special characters which have a meaning in sed's regex syntax. If special characters can be a problem you need to prepare the $search variable, as explained here: Is it possible to escape regex metacharacters reliably with sed
I insist on the point that keeping the length of passwords in logs is a very bad idea (security wise).
Having said that:
First, a character list [...] is not the right tool to match an string. For that we need to use an alternating value (...|...).
end_str="(&|%26)"
But it is quite difficult to express a "not a string" in regex.
not_end_str="([^&]|[^%]|%[^2]|%2[^6])"
Using all that we may build a pure bash solution (maybe not fast, but works).
It prints to stdout to show how it works.
Redirect to a file to store the result.
file=mylog.log
search_str="&Name="
end_str="(&|%26)" # write end_str as an alternate value.
not_end_str="([^&]|[^%]|%[^2]|%2[^6])" # regex negate end_string.
# Build a regex that split each part of the text.
myreg="(.*${search_str}X*)(${not_end_str}*)([&%].*)"
while IFS=$'\n' read line; do
[[ $line =~ $myreg ]]
len=$((${#BASH_REMATCH[#]}-2)) # do not count [0] and last.
arr=("${BASH_REMATCH[#]:1}") # remove [0] and last.
arr[1]=${arr[1]//?/X} # Replace name with "X"'s.
arr[2]='' # Clear not_end_str match
printf '%s' "${arr[#]}"; echo # Print modified line.
done <"$file"
Further reading:
Regular expression to match line that doesn't contain a word?
Regular expression that doesn't contain certain string

Remove characters in all text files in a directory using sed

I have a lot of text files that are email templates. Many of them, for some reason, have the following line:
Best Regards,œ
That strange character at the end is what I am interested in removing from all of these files with a single command.
I tried:
for f in *
do
sed 's/"Best Regards,œ"/"Best Regards,"/g' $f | tee $f.t && mv $f.t $f
done
This ran through the process but did not actually remove the 'œ' character.
Please let me know what I am doing incorrectly so I can remove this character and maybe other non-alphanumeric characters using regex [:alnum:], perhaps.
I fixed the issue with removing the unwanted character with:
for f in *
do
sed 's/Best\ Regards\,\œ/Best\ Regards\,/g' $f | tee $f.t && mv $f.t $f
done
However, this still does not remove all of the non-alphanumeric characters from each line of each file. The other things I have tried either do not execute or remove the entire line.
I appreciate your help.
If ① you don't want to have to worry about Unicode, UTF-anything, LANG, etc, and ② you are confident that lines that start with the words "Best Regards," and ONLY those lines are the ones you want to affect, you can simply do this:
sed -i .bak '/^Best Regards,.*/s//Best Regards,/' *
Note that this processes all files in the current directory. If you want to do this in subdirectories, you could use find, with all its goodness. For example:
find /path/to/start/ -exec \
sed -i .bak '/^Best Regards,.*/s//Best Regards,/' {} \;
or if your shell is bash, you could use globstar:
shopt -s globstar
for f in **/*; do
sed -i .bak '/^Best Regards,.*/s//Best Regards,/' "$f"
done
Rather than using tee and mv, these solutions use sed's built-in "in-place" option, and creates a .bak file as a result. Consult the documentation for your implementation of sed to learn more about how to use the -i option -- it works a little differently with different seds.
This approach eliminates the need to search for that character in particular, so you won't need to worry about how it's being represented. Beware though, it will also eliminate any other text that follows the search string on the same line.
You don't need the loop. You can pass the results of the glob expression directly to sed and use the -i option for in place editing of files:
sed -i.bak 's/Best Regards,œ/Best Regards,/' *
-i.bak changes the input file in place and creates a backup file with the extension .bak.
Some implementations of sed, for example GNU sed even support -i without an argument other allow an empty string as argument for -i. In that case sed will not keep any backup files and simply change the original file.
With GNU sed:
sed -i 's/Best Regards,œ/Best Regards,/' *
# OR (BSD, MacOS)
sed -i '' 's/Best Regards,œ/Best Regards,/' *

Removing duplicate entries from files on the basis of substring postfixes

Let's say that I have the following text in a file:
foo.bar.baz
bar.baz
123.foo.bar.baz
pqr.abc.def
xyz.abc.def
abc.def.ghi.jkl
def.ghi.jkl
How would I remove duplicates from the file, on the basis of postfixes? The expected output without duplicates would be:
bar.baz
pqr.abc.def
xyz.abc.def
def.ghi.jkl
(Consider foo.bar.baz and bar.baz. The latter is a substring postfix so only bar.baz remains. However, neither of pqr.abc.def and xyz.abc.def are not substring postfixes of each other, so both remain.)
Try this:
#!/bin/bash
INPUT_FILE="$1"
in="$(cat $INPUT_FILE)"
out="$in"
for line in $in; do
out=$(echo "$out" | grep -v "\.$line\$")
done
echo "$out"
You need to save it to a script (e.g. bashor.sh), make it executable (chmod +x bashor.sh) and call it with your input file as the first argument:
./bashor.sh path/to/input.txt
Use sed to escape the string for regular expressions, prefix ., postfix $ and pipe this into GNU grep (-f - doesn't work with BSD grep, eg. on a mac).
sed 's/[^-A-Za-z0-9_]/\\&/g; s/^/./; s/$/$/' test.txt |grep -vf - test.txt
I just used to regular expression escaping from another answer and didn't think about whether it is reasonable. On first sight it seems fine, but escapes too much, though probably this is not an issue.

Insert line after match using sed

For some reason I can't seem to find a straightforward answer to this and I'm on a bit of a time crunch at the moment. How would I go about inserting a choice line of text after the first line matching a specific string using the sed command. I have ...
CLIENTSCRIPT="foo"
CLIENTFILE="bar"
And I want insert a line after the CLIENTSCRIPT= line resulting in ...
CLIENTSCRIPT="foo"
CLIENTSCRIPT2="hello"
CLIENTFILE="bar"
Try doing this using GNU sed:
sed '/CLIENTSCRIPT="foo"/a CLIENTSCRIPT2="hello"' file
if you want to substitute in-place, use
sed -i '/CLIENTSCRIPT="foo"/a CLIENTSCRIPT2="hello"' file
Output
CLIENTSCRIPT="foo"
CLIENTSCRIPT2="hello"
CLIENTFILE="bar"
Doc
see sed doc and search \a (append)
Note the standard sed syntax (as in POSIX, so supported by all conforming sed implementations around (GNU, OS/X, BSD, Solaris...)):
sed '/CLIENTSCRIPT=/a\
CLIENTSCRIPT2="hello"' file
Or on one line:
sed -e '/CLIENTSCRIPT=/a\' -e 'CLIENTSCRIPT2="hello"' file
(-expressions (and the contents of -files) are joined with newlines to make up the sed script sed interprets).
The -i option for in-place editing is also a GNU extension, some other implementations (like FreeBSD's) support -i '' for that.
Alternatively, for portability, you can use perl instead:
perl -pi -e '$_ .= qq(CLIENTSCRIPT2="hello"\n) if /CLIENTSCRIPT=/' file
Or you could use ed or ex:
printf '%s\n' /CLIENTSCRIPT=/a 'CLIENTSCRIPT2="hello"' . w q | ex -s file
Sed command that works on MacOS (at least, OS 10) and Unix alike (ie. doesn't require gnu sed like Gilles' (currently accepted) one does):
sed -e '/CLIENTSCRIPT="foo"/a\'$'\n''CLIENTSCRIPT2="hello"' file
This works in bash and maybe other shells too that know the $'\n' evaluation quote style. Everything can be on one line and work in
older/POSIX sed commands. If there might be multiple lines matching the CLIENTSCRIPT="foo" (or your equivalent) and you wish to only add the extra line the first time, you can rework it as follows:
sed -e '/^ *CLIENTSCRIPT="foo"/b ins' -e b -e ':ins' -e 'a\'$'\n''CLIENTSCRIPT2="hello"' -e ': done' -e 'n;b done' file
(this creates a loop after the line insertion code that just cycles through the rest of the file, never getting back to the first sed command again).
You might notice I added a '^ *' to the matching pattern in case that line shows up in a comment, say, or is indented. Its not 100% perfect but covers some other situations likely to be common. Adjust as required...
These two solutions also get round the problem (for the generic solution to adding a line) that if your new inserted line contains unescaped backslashes or ampersands they will be interpreted by sed and likely not come out the same, just like the \n is - eg. \0 would be the first line matched. Especially handy if you're adding a line that comes from a variable where you'd otherwise have to escape everything first using ${var//} before, or another sed statement etc.
This solution is a little less messy in scripts (that quoting and \n is not easy to read though), when you don't want to put the replacement text for the a command at the start of a line if say, in a function with indented lines. I've taken advantage that $'\n' is evaluated to a newline by the shell, its not in regular '\n' single-quoted values.
Its getting long enough though that I think perl/even awk might win due to being more readable.
A POSIX compliant one using the s command:
sed '/CLIENTSCRIPT="foo"/s/.*/&\
CLIENTSCRIPT2="hello"/' file
Maybe a bit late to post an answer for this, but I found some of the above solutions a bit cumbersome.
I tried simple string replacement in sed and it worked:
sed 's/CLIENTSCRIPT="foo"/&\nCLIENTSCRIPT2="hello"/' file
& sign reflects the matched string, and then you add \n and the new line.
As mentioned, if you want to do it in-place:
sed -i 's/CLIENTSCRIPT="foo"/&\nCLIENTSCRIPT2="hello"/' file
Another thing. You can match using an expression:
sed -i 's/CLIENTSCRIPT=.*/&\nCLIENTSCRIPT2="hello"/' file
Hope this helps someone
The awk variant :
awk '1;/CLIENTSCRIPT=/{print "CLIENTSCRIPT2=\"hello\""}' file
I had a similar task, and was not able to get the above perl solution to work.
Here is my solution:
perl -i -pe "BEGIN{undef $/;} s/^\[mysqld\]$/[mysqld]\n\ncollation-server = utf8_unicode_ci\n/sgm" /etc/mysql/my.cnf
Explanation:
Uses a regular expression to search for a line in my /etc/mysql/my.cnf file that contained only [mysqld] and replaced it with
[mysqld]
collation-server = utf8_unicode_ci
effectively adding the collation-server = utf8_unicode_ci line after the line containing [mysqld].
I had to do this recently as well for both Mac and Linux OS's and after browsing through many posts and trying many things out, in my particular opinion I never got to where I wanted to which is: a simple enough to understand solution using well known and standard commands with simple patterns, one liner, portable, expandable to add in more constraints. Then I tried to looked at it with a different perspective, that's when I realized i could do without the "one liner" option if a "2-liner" met the rest of my criteria. At the end I came up with this solution I like that works in both Ubuntu and Mac which i wanted to share with everyone:
insertLine=$(( $(grep -n "foo" sample.txt | cut -f1 -d: | head -1) + 1 ))
sed -i -e "$insertLine"' i\'$'\n''bar'$'\n' sample.txt
In first command, grep looks for line numbers containing "foo", cut/head selects 1st occurrence, and the arithmetic op increments that first occurrence line number by 1 since I want to insert after the occurrence.
In second command, it's an in-place file edit, "i" for inserting: an ansi-c quoting new line, "bar", then another new line. The result is adding a new line containing "bar" after the "foo" line. Each of these 2 commands can be expanded to more complex operations and matching.

How to append to specific lines in a flat file using shell script

I have a flat file that contains something like this:
11|30646|654387|020751520
11|23861|876521|018277154
11|30645|765418|016658304
Using shell script, I would like to append a string to certain lines in this file, if those lines contain a specific string.
For example, in the above file, for lines containing 23861, I would like to append a string "Processed" at the end, so that the file becomes:
11|30646|654387|020751520
11|23861|876521|018277154|Processed
11|30645|765418|016658304
I could use sed to append the string to all lines in the file, but how do I do it for specific lines ?
I'd do it this way
sed '/\|23861\|/{s/$/|Something/;}' file
This is similar to Marcelo's answer but doesn't require extended expressions and is, I think, a little cleaner.
First, match lines having 23861 between pipes
/\|23861\|/
Then, on those lines, replace the end-of-line with the string |Something
{s/$/|Something/;}
If you want to do more than one of these you could simply list them
sed '/\|23861\|/{s/$/|Something/;};/\|30645\|/{s/$/|SomethingElse/;}' file
Use the following awk-script:
$ awk '/23861/ { $0=$0 "|Processed" } {print}' input
11|30646|654387|020751520
11|23861|876521|018277154|Processed
11|30645|765418|016658304
or, using sed:
$ sed 's/\(.*23861.*$\)/\1|Processed/' input
11|30646|654387|020751520
11|23861|876521|018277154|Processed
11|30645|765418|016658304
Use the substitution command:
sed -i~ -E 's/(\|23861\|.*)/\1|Processed/' flat.file
(Note: the -i~ performs the substitution in-place. Just leave it out if you don't want to modify the original file.)
You can use the shell
while read -r line
do
case "$line" in
*23681*) line="$line|Processed";;
esac
echo "$line"
done < file > tempo && mv tempo file
sed is just a stream version of ed, which has a similar command set but was designed to edit files in place (allegedly interactively, but you wouldn't want to use it that way unless all you had was one of these). Something like
field_2_value=23861
appended_text='|processed'
line_match_regex="^[^|]*|$field_2_value|"
ed "$file" <<EOF
g/$line_match_regex/s/$/$appended_text/
wq
EOF
should get you there.
Note that the $ in .../s/$/... is not expanded by the shell, as are $line_match_regex and $appended_text, because there's no such thing as $/ - instead it's passed through as-is to ed, which interprets it as text to substitute ($ being regex-speak for "end of line").
The syntax to do the same job in sed, should you ever want to do this to a stream rather than a file in place, is very similar except that you don't need the leading g before the regex address:
sed -e "/$line_match_regex/s/$/$appended_text/" "$input_file" >"$output_file"
You need to be sure that the values you put in field_2_value and appended_text never contain slashes, because ed's g and s commands use those for delimiters.
If they might do, and you're using bash or some other shell that allows ${name//search/replace} parameter expansion syntax, you could fix them up on the fly by substituting \/ for every / during expansion of those variables. Because bash also uses / as a substitution delimiter and also uses \ as a character escape, this ends up looking horrible:
appended_text='|n/a'
ed "$file" <<EOF
g/${line_match_regex//\//\\/}/s/$/${appended_text//\//\\/}/
wq
EOF
but it does work. Nnote that both ed and sed require a trailing / after the replacement text in s/search/replace/ while bash's ${name//search/replace} syntax doesn't.

Resources