I'm parsing a file that may contain control-characters (ASCII 0-31). Now I want to replace each of those control-characters with their ASCII-code in hexadecimal representation. A rather simple example of what I have in mind:
$ echo -e "a\011b" | sed -e 's/\o11/\\x09/g'
a\x09b
This converts the tab (\011) to \x09, so the a<tab>b becomes a\x09b.
Obviously I could use 32 -e-parameters, but I consider that bad. Is there a generic approach to this?
BTW, it's not a problem if the \n remains a \n. sed isn't required.
I would use Perl. Note that tab is actually 9, not 8 - if you're trying to change the value, then this is incorrect, but if you're just encoding, this should do the trick:
echo -e "a\011b" | perl -lpe 's/[\0-\037\177]/sprintf "\\x%02x", ord $&/ge'
Related
Given:
itemName='boo\boo\1\7\064.txt'
I want to convert the octals to printables while removing unprintables. The catch: I don't want to remove backslashed alphas like the \b. The result should be:
newItemName='boo\boo4.txt'
I can't figure out why part of the sed statement doesn't work correctly:
newItemName="$(printf "%s" "$itemName" | sed -E 's/(\\[0-7]{1,3})/'"$(somevar="&";printf "${somevar:1}";)"'/g' | tr -dc '[:print:]')"
I used somevar="&"; instead of directly accessing & so I could use variable manipulation.
The search statement s/(\[0-7]{1,3})/ works fine.
In the printf if I use $somevar or ${somevar:0} instead of ${somevar:1} I get the original string as expected (e.g. \064).
What doesn't work is the ${somevar:1}.
These also don't work: ${somevar/\/} or ${somevar//\/}.
What am I misunderstanding about how variable manipulation works?
Is there an easier way to do this? I've searched and searched...
Sam; long time no see! The problem here is the order of evaluation. All of the shell expressions, including the $(somevar="&";printf "${somevar:1}";), are evaluated before sed is even launched. As a result, somevar isn't the string matched by the regex, it's just a literal ampersand. That means ${somevar:1} is just the empty string, and you wind up just running sed -E 's/(\\[0-7]{1,3})//g'.
You need a way to take the matched string and run a calculation on it (after it's been matched), and sed just isn't flexible enough to do this. But perl is. perl has an s operator, similar to sed's, but with the e option the replacement is executed as a perl expression rather than just a literal string. Give this a try:
newItemName="$(printf "%s\n" "$itemName" | perl -pe 's/\\([0-7]{1,3})/chr oct $1/eg' | tr -dc '[:print:]')"
What am I misunderstanding about how variable manipulation works?
I believe you are misunderstanding how sed works.
When & character is used inside the replacement string, it is replaced by the whole string matched. See this sed introduction.
Now about ${var:offset} parameter expansion:
somevar=&
printf "$somevar"
would print &. Then:
printf "${somevar:1}"
would extract substring starting at offset 1 to the end of string. The first character is at offset, well, 0, so at at offset 1 there is no character, because out variable somevar has one character. So it will print nothing.
printf "${somevar:0}"
would print a substring starting at offset 0 to the end of the string. So the whole string. So ${somevar:0} is equal to $somevar. It will print &.
So:
$(somevar="&";printf "${somevar:1}";)
expands to nothing, because ${somevar:1} expands to nothing. So you sed command looks like this:
sed -E 's/(\\[0-7]{1,3})//g'
The sed command substitutes a \ character followed by a number 0-7 one to 3 times for nothing, multiple times. It does what you want.
Now if it would be ${somevar:0} then:
$(somevar="&";printf "${somevar:0}";)
expands to &, so your sed command would look like this:
sed -E 's/(\\[0-7]{1,3})/&/g'
so it would substitute a \\[0-7]{1,3} for itself. Ie. it does nothing.
You could loose the -E option and (...) backreference, and just use posixly compatible sed:
sed 's/\\[0-7]\{1,3\}//g'
Is there an easier way to do this?
Your method looks fine. You could use a here string instead of printf and you could strengthen the sed to match octal numbers better, depending on needs:
newItemName="$(
<<<"$itemName" sed 's/\\\([0-3][0-7]\{0,2\}\|[0-7]\{1,2\}\)//g' |
tr -dc '[:print:]'
)"
I'm trying to replace a special character with sed, the character are Þ to replace for ;
The lines of the file are, for example;
0370ÞA020Þ4000011600ÞRED USADOÞ0,00Þ20190414
0370ÞA020Þ4000011601ÞRED USADOÞ0,00Þ20190414
0370ÞA020Þ4000011602ÞRED USADOÞ0,00Þ20190414
Thanks!
Edit
Its worked and solved.
Thanks!!!
Try this - simple substitution work for me
sed 's/Þ/;/g'
That's the job tr was created to do but look at these results:
$ tr 'Þ' ';' < file
0370;;A020;;4000011600;;RED USADO;;0,00;;20190414
0370;;A020;;4000011601;;RED USADO;;0,00;;20190414
0370;;A020;;4000011602;;RED USADO;;0,00;;20190414
$ sed 's/Þ/;/g' < file
0370;A020;4000011600;RED USADO;0,00;20190414
0370;A020;4000011601;RED USADO;0,00;20190414
0370;A020;4000011602;RED USADO;0,00;20190414
tr seems to consider every Þ as being 2 duplicate characters - sed may think the same but while tr is converting a set of chars to a set of chars, sed is converting a regexp to a string and so even if it considers Þ to be 2 characters wide it'll still do what you want. So just an interesting warning about trying to use tr to replace non-ASCII characters - YMMV!
if your data in 'd' file, try gnu sed:
sed -E 'y/Þ/;/' d
How to use Sed to replace all bold characters? (for example from 0200 to 0300)
the whole command is in one line
NSun0000-0000Mon0200+2130Tue0200+2130Wed0200+2130Thu0200+2130Fri0200+2130Sat0000-0000
This must be a universal command because the digits can change (but will always be in the same place).
Assuming you have Bash and you want to change each one to 0300.
ubuntu$ sed -E 's/([a-zA-Z]{3})([0-9]{4})/\10300/g' text.txt
NSun0300-0000Mon0300+2130Tue0300+2130Wed0300+2130Thu0300+2130Fri0300+2130Sat0300-0000
Regards!
I need to cut a number of characters from the beginning and end of a string. The string is does not have a specific format and can be random numbers and words. I am trying to remove 5 characters in the beginning and 11 from the end of the string.
Input string:
342136001788006DEEFF0000060000806000006HSV40002HP
Output string:
6001788006DEEFF000006000080600000
The bolded characters 3413 and 6HSV40002HP are removed from the input.
it's ok found my answer using cut command which I was so focusing with awk & sed , but cut helps in the end
cut -c6-38 test.txt
You found the cut commamd wat is the best solution in this case.
You wondered how you should do this with sed, which will be interesting for more complex situations.
The noob solution is (using ; for 2 different substititions and $ for end-of-line):
echo '342136001788006DEEFF0000060000806000006HSV40002HP' |
sed 's/.....//;s/...........$//'
You do not want to count the dots, you can tell how often a pattern repeats with pattern{count}.
And you can remember/recall a pattern with `s/..(pattern)../\1/'.
echo '342136001788006DEEFF0000060000806000006HSV40002HP' |
sed 's/.\{5\}\(.*\).\{11\}/\1/'
When your sed supports the flog -r, you can avoid all thise backslashes:
echo '342136001788006DEEFF0000060000806000006HSV40002HP' |
sed -r 's/.{5}(.*).{11}/\1/'
Input:-
echo "1234ABC89,234" # A
echo "0520001DEF78,66" # B
echo "46545455KRJ21,00"
From the above strings, I need to split the characters to get the alphabetic field and the number after that.
From "1234ABC89,234", the output should be:
ABC
89,234
From "0520001DEF78,66", the output should be:
DEF
78,66
I have many strings that I need to split like this.
Here is my script so far:
echo "1234ABC89,234" | cut -d',' -f1
but it gives me 1234ABC89 which isn't what I want.
Assuming that you want to discard leading digits only, and that the letters will be all upper case, the following should work:
echo "1234ABC89,234" | sed 's/^[0-9]*\([A-Z]*\)\([0-9].*\)/\1\n\2/'
This works fine with GNU sed (I have 4.2.2), but other sed implementations might not like the \n, in which case you'll need to substitute something else.
Depending on the version of sed you can try:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1\n\2/'
or:
echo "0520001DEF78,66" | sed -E -e 's/[0-9]*([A-Z]*)([,0-9]*)/\1$\2/' | tr '$' '\n'
DEF
78,66
Explanation: the regular expression replaces the input with the expected output, except instead of the new-line it puts a "$" sign, that we replace to a new-line with the tr command
Where do the strings come from? Are they read from a file (or other source external to the script), or are they stored in the script? If they're in the script, you should simply reformat the data so it is easier to manage. Therefore, it is sensible to assume they come from an external data source such as a file or being piped to the script.
You could simply feed the data through sed:
sed 's/^[0-9]*\([A-Z]*\)/\1 /' |
while read alpha number
do
…process the two fields…
done
The only trick to watch there is that if you set variables in the loop, they won't necessarily be visible to the script after the done. There are ways around that problem — some of which depend on which shell you use. This much is the same in any derivative of the Bourne shell.
You said you have many strings like this, so I recommend if possible save them to a file such as input.txt:
1234ABC89,234
0520001DEF78,66
46545455KRJ21,00
On your command line, try this sed command reading input.txt as file argument:
$ sed -E 's/([0-9]+)([[:alpha:]]{3})(.+)/\2\t\3/g' input.txt
ABC 89,234
DEF 78,66
KRJ 21,00
How it works
uses -E for extended regular expressions to save on typing, otherwise for example for grouping we would have to escape \(
uses grouping ( and ), searches three groups:
firstly digits, + specifies one-or-more of digits. Oddly using [0-9] results in an extra blank space above results, so use POSIX class [[:digit:]]
the next is to search for POSIX alphabetical characters, regardless if lowercase or uppercase, and {3} specifies to search for 3 of them
the last group searches for . meaning any character, + for one or more times
\2\t\3 then returns group 2 and group 3, with a tab separator
Thus you are able to extract two separate fields per line, just separated by tab, for easier manipulation later.