This question already has answers here:
What is character encoding and why should I bother with it
(4 answers)
Closed 2 years ago.
I'm trying to do the following:
LC_CTYPE=C sed 's/|/¦/g' t.txt > new_t.txt
The code is working but, when I open the new file, the replace adds an additional character "A¦". Why is that?
When you typed
LC_CTYPE=C sed 's/|/¦/g' t.txt > new_t.txt
your shell was probably configured to accept the command itself as UTF-8, and so in fact you ended up converting the single byte 0x7C (U+007C) to the two bytes 0xC2 0xA6 which is the correct UTF-8 encoding for U+00A6.
What you then did is unclear, but somehow you ended up examining the file in some other encoding than UTF-8, which exposes the two bytes as the string you report seeing.
The correct workaround is to examine the file in a correctly configured program which supports UTF-8.
Related
This question already has answers here:
Are shell scripts sensitive to encoding and line endings?
(14 answers)
grep not showing result which read id from file
(2 answers)
Closed 12 months ago.
My small file contains this information line by line:
abc.123
abc.258
abc.952
I wanted to get those lines matching in my bigger file (~30Gb). I tried this command but it didn't give me any result.
grep -f small.txt big.txt
I have tested all abc.123, abc.258 and abc.952 does exist in my bigger file, meaning that I tried to grep each of these names one by one it gave me the exact result I want.
grep "abc.123" big.txt
I have no idea where I could possibly go wrong?
This question already has answers here:
Using different delimiters in sed commands and range addresses
(3 answers)
Closed 12 months ago.
I have many repositories with the remote origin set to HTTPS. Now I want to change all origin remotes to SSH.
I am using a command for this in which I want to replace all preceeding url = https://gitlab.mypath/ with git#gitlab.maypath:.
Is there a way to express this with one sed call. Something like:
's#https://gitlab.mypath/#git#gitlab.mypath:#g'
I have to be able to escape the first "#g"
The correct anwers
Switch from s### to s|||
Apart from choosing any character for the delimiter, you can also escape that character: #
The explanation
The character after the s specifies the separator, which must occur three times in your s command.
Thanks
#Cyrus #dan
This question already has answers here:
Bash script to convert from HTML entities to characters
(12 answers)
Closed 4 years ago.
I am scraping a website with curl and parsing out what I need.
The URLs are returned with Ascii encoded characters like
GET v2.12/...?fields={fieldname_of_type_Tab} HTTP/1.1
How can I convert this to UTF-8 (char) directly from the command line (ideally something I can pipe | to) so that the result is...
GET v2.12/...?fields={fieldname_of_type_Tab} HTTP/1.1
EDIT: There are a number of solutions with sed but the regex that goes along with it is quite ugly. Since the provided answer leveraging perl is very clean I hope we can leave this question open
It's html-entities.
Decode like this using perl :
$ echo 'http://domain.tld/?fields={fieldname_of_type_Tab}' |
perl -MHTML::Entities -pe 'decode_entities($_)'
Output :
http://domain.tld/?fields={fieldname_of_type_Tab}
I have a script which is reading some data from one server and storing it in a file. But the file seems somehow corrupt. I can print it to the display, but checking the file with file produces
bash$ file -I filename
filename: text/plain; charset=unknown-8bit
Why is it telling me that the encoding is unknown? The first line of the file displays for me as
“The Galaxy A5 and A3 offer a beautifully crafted full metal unibody
A hex dump reveals that the first three bytes are 0xE2, 0x80, 0x9C followed by the regular ASCII text The Galaxy A5...
What's wrong? Why does file tell me the encoding is unknown, and what is it actually?
Based on the information in the question, the file is a perfectly good UTF-8 file. The first three bytes encode LEFT DOUBLE QUOTATION MARK (U+201C) aka a curly quote.
Maybe your version of file is really old.
You can use iconv to convert the file into the desired charset. E.G.
iconv --from-code=UTF8 --to-code=YOURTARGET
To get a list of supported targets, use the --list flag.
This question already has answers here:
RE error: illegal byte sequence on Mac OS X
(7 answers)
Closed 7 years ago.
Doing some stream editing to change the nasty Parallels icon. It's poorly developed and embedded into the app itself rather than being an image file. So I've located this sed command that has some good feedback:
sudo sed -i.bak s/Parallels_Desktop_Overlay_128/Parallels_Desktop_Overlay_000/g /Applications/Parallels\ Desktop.app/Contents/MacOS/prl_client_app
It returns sed: RE error: illegal byte sequence
Can anyone explain what this means? What part of the command is the problem?
Try setting the LANG environment variable (LANG=C sed ...) or use one of the binary sed tools mentioned here: binary sed replacement
Why the error?
Without LANG=C sed assumes that files are encoded in whatever encoding is specified in LANG and the file (being binary) may contain bytes which are not valid characters in LANG's encoding (thus you could get 'illegal byte sequence').
Why does LANG=C work?
C just happens to treat all ASCII characters as themselves and non-ASCII characters as literals.
LANG=C alone didn't do the trick for me but adding LC_CTYPE=C as well solved it.
In addition to LANG=C and LC_CTYPE=C, I had to do LC_ALL=C to get this to work.
LC_ALL overrides all individual LC_* categories. Thus, the most robust approach is to use LC_ALL=C sed ... - no need to also deal with the other variables.
I managed to do it by running:
unset LANG
before the sed command.
Not sure what I've done or why it works but it did.