This question already has answers here:
Bash script to convert from HTML entities to characters
(12 answers)
Closed 4 years ago.
I am scraping a website with curl and parsing out what I need.
The URLs are returned with Ascii encoded characters like
GET v2.12/...?fields={fieldname_of_type_Tab} HTTP/1.1
How can I convert this to UTF-8 (char) directly from the command line (ideally something I can pipe | to) so that the result is...
GET v2.12/...?fields={fieldname_of_type_Tab} HTTP/1.1
EDIT: There are a number of solutions with sed but the regex that goes along with it is quite ugly. Since the provided answer leveraging perl is very clean I hope we can leave this question open
It's html-entities.
Decode like this using perl :
$ echo 'http://domain.tld/?fields={fieldname_of_type_Tab}' |
perl -MHTML::Entities -pe 'decode_entities($_)'
Output :
http://domain.tld/?fields={fieldname_of_type_Tab}
Related
This question already has answers here:
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed 3 months ago.
Suppose there is a text file test.txt. It contains text and links to resources such as https://example.com/kqodbjcuic49w95rofwjue. How can I extract only the list of these links from there? (preferably via bash, but not required)
I tried this solution:
sed 's/^.*href="\([^"]*\).*$/\1/'
But it didn't help me.
grep -o "((?:(?:http|ftp|ws)s?|sftp):\/\/?)?([^:/\s.#?]+\.[^:/\s#?]+|localhost)(:\d+)?((?:\/\w+)*\/)?([\w\-.]+[^#?\s]+)?([^#]+)?(#[\w-]*)?" test.txt
will display all URLs inside the file.
(The regex comes from BSimjoo's link)
Grep text files guide at https://www.linode.com/docs/guides/how-to-grep-for-text-in-files/
This question already has answers here:
Are shell scripts sensitive to encoding and line endings?
(14 answers)
grep not showing result which read id from file
(2 answers)
Closed 12 months ago.
My small file contains this information line by line:
abc.123
abc.258
abc.952
I wanted to get those lines matching in my bigger file (~30Gb). I tried this command but it didn't give me any result.
grep -f small.txt big.txt
I have tested all abc.123, abc.258 and abc.952 does exist in my bigger file, meaning that I tried to grep each of these names one by one it gave me the exact result I want.
grep "abc.123" big.txt
I have no idea where I could possibly go wrong?
This question already has answers here:
What is character encoding and why should I bother with it
(4 answers)
Closed 2 years ago.
I'm trying to do the following:
LC_CTYPE=C sed 's/|/¦/g' t.txt > new_t.txt
The code is working but, when I open the new file, the replace adds an additional character "A¦". Why is that?
When you typed
LC_CTYPE=C sed 's/|/¦/g' t.txt > new_t.txt
your shell was probably configured to accept the command itself as UTF-8, and so in fact you ended up converting the single byte 0x7C (U+007C) to the two bytes 0xC2 0xA6 which is the correct UTF-8 encoding for U+00A6.
What you then did is unclear, but somehow you ended up examining the file in some other encoding than UTF-8, which exposes the two bytes as the string you report seeing.
The correct workaround is to examine the file in a correctly configured program which supports UTF-8.
This question already has answers here:
What's the most robust way to efficiently parse CSV using awk?
(6 answers)
Closed 4 years ago.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a file like this:
col1×col2×col3
12×"Some field with "quotes" inside it"×"Some field without quotes inside but with new lines \n"
And I would like to replace the interior double quotes with single quotes so the result will look like this:
col1×col2×col3
12×"Some field with 'quotes' inside it"×"Some field without quotes inside but with new lines \n"
I guess this can be done with sed, awk or ex but I haven't been able to figure out a clean and quick way of doing it. Real CSV files are of the order of millions of lines.
The preferred solution would be a one-liner using the aforementioned programs.
A simple workaround using sed, based on your fields separator ×, could be:
sed -E "s/([^×])\"([^×])/\1'\2/g" file
This replace each " which is preceded and followed by any characters other that ×, with '.
Note that sed not support positive lookahead, so we have to group and reinsert the patterns.
This question already has an answer here:
Why is a shell script giving syntax errors when the same code works elsewhere? [duplicate]
(1 answer)
Closed 5 years ago.
I've been looking for a solution to my problem all the morning, especially in the 4 posts in https://stackoverflow.com having the same error name in their title but the solutions don't work for me.
I want to do several simple cURL requests put together in a Bash script. The request at the end of the file always works, whatever request it is. However the requests before return an error:
curl: (3) Illegal characters found in URL
I am pretty sure that it has something to do with the carriage return in my file. But I don't know how to deal with it. As I show in the picture below I tried to use ${url1%?}. I also tried ${url1%$'\r'}, but it doesn't change anything.
Screenshot of file + results in terminal:
Any ideas?
If your lines end with \r, stripping away the \r from the $url won't work, because the line
curl -o NUL "{url1%?}
also ends with a \r, which is appended to the url argument again.
Comment out the \r, that is
url1="www.domain.tld/file"
curl -o NUL "${url1%?}" #
or
url1="www.domain.tld/file" #
curl -o NUL "$url1" #
or convert the file before executing it
tr -d '\r' < test.sh > testWithoutR.sh