Shell: converting any number in a text file from decimal to hexadecimal - shell

In a slightly different question from this other, I would like to convert any number in a text file from decimal to hexadecimal.
A number is here defined by a set of numeric characters together.
Example:
$ cat MyFile.txt
Hello,10,Good255Bye-boys01
Must become:
Hello,0A,GoodFFBye-boys01
Valid too:
Hello,A,GoodFFBye-boys1
Methods that allow (first case) to specify the character wide (to obtain 0A instead of A) are preferred.
I have tested grep to extract the numbers piped to bc to convert them:
( echo "obase=16" ; cat Line.txt |grep -o '[0-9]*') | bc
but this method shows only one (converted to hex) number each line, and removes the rest of the characters.

Since you're okay with using grep and bc in a pipe, it's clear that you don't want a solution in pure sh, but are happy to use external tools.
perl -pe 's/([0-9]+)/sprintf "%02X", $1/ge' myfile.txt

A Python solution (thanks to the suggestion from user 4ae1e1):
$ cat convert.py
#!/usr/bin/env python3
import fileinput
import re
for line in fileinput.input():
print(re.sub("\d+", lambda matchobj: "%X" % int(matchobj.group(0)), line), end="")
Example usage:
cat MyFile.txt | ./convert.py
or:
./convert.py MyFile.txt

dec2hex command will do the task.

Related

How to extract only the English words and leaving the Devanagari words in bash script?

The text file is like this,
#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर#_
The desired text file should be like,
#
1
8IU
underscore
$that
%redyellow
$
#_
This is what I have tried so far, using awk
awk -F"[अ-ह]*" '{print $1}' filename.txt
And the output that I am getting is,
#
1
$that
%red
$
and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,
#
1 े
ं
ो
$that
%red yellow
$ ि
ं
Is there anyway to solve this in bash script?
Using perl:
$ perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
#_
-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.
The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.
Using awk you can do:
awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow
See documentation: https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html
using [\x00-\x7F].
This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.
tr is a very good fit for this task:
LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt
It sets the POSIX C locale environment so that only US English character set is valid.
Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

Why grep not function as expected with large file? [duplicate]

grep returns
Binary file test.log matches
For example
echo "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in bash
grep re test.log
I wish the result will show line1 and line3 (total two lines).
Is it possible to use tr convert the unprintable data into readable data, to let grep work again?
grep -a
It can't get simpler than that.
One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).
Alternatively, you can send your file through tr with the following command:
tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever
This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.
If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:
#include<stdio.h>
int main (void) {
int ch;
while ((ch = getchar()) != EOF) {
if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
putchar (ch);
} else {
printf ("{{%02x}}", ch);
}
}
return 0;
}
This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.
You can see that program in action here, where it:
pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob
You could run the data file through cat -v, e.g
$ cat -v tmp/test.log | grep re
line1 re ^#^M
line3 re^M
which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.
-v simply tells cat to display non-printing characters.
You can use "strings" to extract strings from a binary file, for example
strings binary.file | grep foo
You can force grep to look at binary files with:
grep --binary-files=text
You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.
Starting with Grep 2.21, binary files are treated differently:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators
grep -a will force grep to search and output from a file that grep thinks is binary.
grep -a re test.log
As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text.
See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
try
cat test.log | grep -a somestring
you can do
strings test.log | grep -i
this will convert give output as a readable string to grep.
Here's what I used in a system that didn't have "strings" command installed
cat yourfilename | tr -cd "[:print:]"
This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.
You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

Unix - How to convert octal escape sequences via pipe

I'm pulling data from a file (in this case an exim mail log) and often it saves characters in an escaped octal sequence like \NNN where 'N' represents an octal value 0-7. This mainly happens when the subject is written in non-Latin characters (Arabic for example).
My goal is to find the cleanest way to convert these octal characters to display correctly in my utf-8 enabled terminal, specifically in 'less' as there is the potential for lots of output.
The best approach I have found so far is as follows:
arbitrary_stream | { while read -r temp; do printf %b "$temp\n"; done } | less
This seems to work pretty well, however I would assume that there is some translator tool, or maybe even a flag built into 'less' to handle this. I also found that if you use something like sed to inject a 0 after each \, you can store it as a variable, then use 'echo -e $data' however this was more messy than the previous solution.
Test case:
octalvar="\342\202\254"
expected output in less:
€
I'm looking for something cleaner, more complete or just better than my above solution in the form of either:
echo $octalvar | do_something | less
or
echo $octalvar | less --some_magic_flag
Any suggestions? Or is my solution about as clean as I can expect?
Conversion in GNU awk (for using strtonum). It proved out to be a hassle so the code is a mess and maybe could be streamlined, feel free to advice:
awk '{
while(match($0,/\\[0-8]{3}/)) { # search for \NNNs
o=substr($0,RSTART,RLENGTH) # extract it
sub(/\\/,"0",o) # replace \ with 0 for strtonum
c=sprintf("%c",strtonum(o)) # convert to a character
sub(/\\[0-8]{3}/,c) # replace the \NNN with the char
}
}1' foo > bar
or paste the code between single quotes to a file above_program.awk and run it like awk -f above_program.awk foo > bar. Test file foo:
test 123 \342\202\254
Run it in a non-UTF8 locale, I used locale C:
$ locale
...
LC_ALL=C
$ awk -f above_program.awk foo
test 123 €
If you run it a UTF8 locale, conversion will happen:
$ locale
...
LC_ALL=en_US.utf8
$ awk -f above_program.awk foo
test 123 â¬
This is my current version:
echo $arbitrary | { IFS=$'\n'; while read -r temp; do printf %b "$temp\n"; done; unset IFS; } | iconv -f utf-8 -t utf-8 -c | less

How to grep a text file which contains some binary data?

grep returns
Binary file test.log matches
For example
echo "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in zsh
echo -e "line1 re \x00\r\nline2\r\nline3 re\r\n" > test.log # in bash
grep re test.log
I wish the result will show line1 and line3 (total two lines).
Is it possible to use tr convert the unprintable data into readable data, to let grep work again?
grep -a
It can't get simpler than that.
One way is to simply treat binary files as text anyway, with grep --text but this may well result in binary information being sent to your terminal. That's not really a good idea if you're running a terminal that interprets the output stream (such as VT/DEC or many others).
Alternatively, you can send your file through tr with the following command:
tr '[\000-\011\013-\037\177-\377]' '.' <test.log | grep whatever
This will change anything less than a space character (except newline) and anything greater than 126, into a . character, leaving only the printables.
If you want every "illegal" character replaced by a different one, you can use something like the following C program, a classic standard input filter:
#include<stdio.h>
int main (void) {
int ch;
while ((ch = getchar()) != EOF) {
if ((ch == '\n') || ((ch >= ' ') && (ch <= '~'))) {
putchar (ch);
} else {
printf ("{{%02x}}", ch);
}
}
return 0;
}
This will give you {{NN}}, where NN is the hex code for the character. You can simply adjust the printf for whatever style of output you want.
You can see that program in action here, where it:
pax$ printf 'Hello,\tBob\nGoodbye, Bob\n' | ./filterProg
Hello,{{09}}Bob
Goodbye, Bob
You could run the data file through cat -v, e.g
$ cat -v tmp/test.log | grep re
line1 re ^#^M
line3 re^M
which could be then further post-processed to remove the junk; this is most analogous to your query about using tr for the task.
-v simply tells cat to display non-printing characters.
You can use "strings" to extract strings from a binary file, for example
strings binary.file | grep foo
You can force grep to look at binary files with:
grep --binary-files=text
You might also want to add -o (--only-matching) so you don't get tons of binary gibberish that will bork your terminal.
Starting with Grep 2.21, binary files are treated differently:
When searching binary data, grep now may treat non-text bytes as line
terminators. This can boost performance significantly.
So what happens now is that with binary data, all non-text bytes
(including newlines) are treated as line terminators. If you want to change this
behavior, you can:
use --text. This will ensure that only newlines are line terminators
use --null-data. This will ensure that only null bytes are line terminators
grep -a will force grep to search and output from a file that grep thinks is binary.
grep -a re test.log
As James Selvakumar already said, grep -a does the trick. -a or --text forces Grep to handle the inputstream as text.
See Manpage http://unixhelp.ed.ac.uk/CGI/man-cgi?grep
try
cat test.log | grep -a somestring
you can do
strings test.log | grep -i
this will convert give output as a readable string to grep.
Here's what I used in a system that didn't have "strings" command installed
cat yourfilename | tr -cd "[:print:]"
This prints the text and removes unprintable characters in one fell swoop, unlike "cat -v filename" which requires some postprocessing to remove unwanted stuff. Note that some of the binary data may be printable so you'll still get some gibberish between the good stuff. I think strings removes this gibberish too if you can use that.
You can also try Word Extractor tool. Word Extractor can be used with any file in your computer to separate the strings that contain human text / words from binary code (exe applications, DLLs).

How can I shuffle the lines of a text file on the Unix command line or in a shell script?

I want to shuffle the lines of a text file randomly and create a new file. The file may have several thousands of lines.
How can I do that with cat, awk, cut, etc?
You can use shuf. On some systems at least (doesn't appear to be in POSIX).
As jleedev pointed out: sort -R might also be an option. On some systems at least; well, you get the picture. It has been pointed out that sort -R doesn't really shuffle but instead sort items according to their hash value.
[Editor's note: sort -R almost shuffles, except that duplicate lines / sort keys always end up next to each other. In other words: only with unique input lines / keys is it a true shuffle. While it's true that the output order is determined by hash values, the randomness comes from choosing a random hash function - see manual.]
Perl one-liner would be a simple version of Maxim's solution
perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile
This answer complements the many great existing answers in the following ways:
The existing answers are packaged into flexible shell functions:
The functions take not only stdin input, but alternatively also filename arguments
The functions take extra steps to handle SIGPIPE in the usual way (quiet termination with exit code 141), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping to head.
A performance comparison is made.
POSIX-compliant function based on awk, sort, and cut, adapted from the OP's own answer:
shuf() { awk 'BEGIN {srand(); OFMT="%.17f"} {print rand(), $0}' "$#" |
sort -k1,1n | cut -d ' ' -f2-; }
Perl-based function - adapted from Moonyoung Kang's answer:
shuf() { perl -MList::Util=shuffle -e 'print shuffle(<>);' "$#"; }
Python-based function, adapted from scai's answer:
shuf() { python -c '
import sys, random, fileinput; from signal import signal, SIGPIPE, SIG_DFL;
signal(SIGPIPE, SIG_DFL); lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write("".join(lines))
' "$#"; }
See the bottom section for a Windows version of this function.
Ruby-based function, adapted from hoffmanc's answer:
shuf() { ruby -e 'Signal.trap("SIGPIPE", "SYSTEM_DEFAULT");
puts ARGF.readlines.shuffle' "$#"; }
Performance comparison:
Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs, awk implementation used (e.g., the BSD awk version used on OSX is usually slower than GNU awk and especially mawk), this should provide a general sense of relative performance.
Input file is a 1-million-lines file produced with seq -f 'line %.0f' 1000000.
Times are listed in ascending order (fastest first):
shuf
0.090s
Ruby 2.0.0
0.289s
Perl 5.18.2
0.589s
Python
1.342s with Python 2.7.6; 2.407s(!) with Python 3.4.2
awk + sort + cut
3.003s with BSD awk; 2.388s with GNU awk (4.1.1); 1.811s with mawk (1.3.4);
For further comparison, the solutions not packaged as functions above:
sort -R (not a true shuffle if there are duplicate input lines)
10.661s - allocating more memory doesn't seem to make a difference
Scala
24.229s
bash loops + sort
32.593s
Conclusions:
Use shuf, if you can - it's the fastest by far.
Ruby does well, followed by Perl.
Python is noticeably slower than Ruby and Perl, and, comparing Python versions, 2.7.6 is quite a bit faster than 3.4.1
Use the POSIX-compliant awk + sort + cut combo as a last resort; which awk implementation you use matters (mawk is faster than GNU awk, BSD awk is slowest).
Stay away from sort -R, bash loops, and Scala.
Windows versions of the Python solution (the Python code is identical, except for variations in quoting and the removal of the signal-related statements, which aren't supported on Windows):
For PowerShell (in Windows PowerShell, you'll have to adjust $OutputEncoding if you want to send non-ASCII characters via the pipeline):
# Call as `shuf someFile.txt` or `Get-Content someFile.txt | shuf`
function shuf {
$Input | python -c #'
import sys, random, fileinput;
lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write(''.join(lines))
'# $args
}
Note that PowerShell can natively shuffle via its Get-Random cmdlet (though performance may be a problem); e.g.:
Get-Content someFile.txt | Get-Random -Count ([int]::MaxValue)
For cmd.exe (a batch file):
Save to file shuf.cmd, for instance:
#echo off
python -c "import sys, random, fileinput; lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write(''.join(lines))" %*
I use a tiny perl script, which I call "unsort":
#!/usr/bin/perl
use List::Util 'shuffle';
#list = <STDIN>;
print shuffle(#list);
I've also got a NULL-delimited version, called "unsort0" ... handy for use with find -print0 and so on.
PS: Voted up 'shuf' too, I had no idea that was there in coreutils these days ... the above may still be useful if your systems doesn't have 'shuf'.
Here is a first try that's easy on the coder but hard on the CPU which prepends a random number to each line, sorts them and then strips the random number from each line. In effect, the lines are sorted randomly:
cat myfile | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2- > myfile.shuffled
here's an awk script
awk 'BEGIN{srand() }
{ lines[++d]=$0 }
END{
while (1){
if (e==d) {break}
RANDOM = int(1 + rand() * d)
if ( RANDOM in lines ){
print lines[RANDOM]
delete lines[RANDOM]
++e
}
}
}' file
output
$ cat file
1
2
3
4
5
6
7
8
9
10
$ ./shell.sh
7
5
10
9
6
8
2
1
3
4
A one-liner for python:
python -c "import random, sys; lines = open(sys.argv[1]).readlines(); random.shuffle(lines); print ''.join(lines)," myFile
And for printing just a single random line:
python -c "import random, sys; print random.choice(open(sys.argv[1]).readlines())," myFile
But see this post for the drawbacks of python's random.shuffle(). It won't work well with many (more than 2080) elements.
Simple awk-based function will do the job:
shuffle() {
awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
}
usage:
any_command | shuffle
This should work on almost any UNIX. Tested on Linux, Solaris and HP-UX.
Update:
Note, that leading zeros (%06d) and rand() multiplication makes it to work properly also on systems where sort does not understand numbers. It can be sorted via lexicographical order (a.k.a. normal string compare).
Ruby FTW:
ls | ruby -e 'puts STDIN.readlines.shuffle'
A simple and intuitive way would be to use shuf.
Example:
Assume words.txt as:
the
an
linux
ubuntu
life
good
breeze
To shuffle the lines, do:
$ shuf words.txt
which would throws the shuffled lines to standard output; So, you've to pipe it to an output file like:
$ shuf words.txt > shuffled_words.txt
One such shuffle run could yield:
breeze
the
linux
an
ubuntu
good
life
One liner for Python based on scai's answer, but a) takes stdin, b) makes the result repeatable with seed, c) picks out only 200 of all lines.
$ cat file | python -c "import random, sys;
random.seed(100); print ''.join(random.sample(sys.stdin.readlines(), 200))," \
> 200lines.txt
We have a package to do the very job:
sudo apt-get install randomize-lines
Example:
Create an ordered list of numbers, and save it to 1000.txt:
seq 1000 > 1000.txt
to shuffle it, simply use
rl 1000.txt
If like me you came here to look for an alternate to shuf for macOS then use randomize-lines.
Install randomize-lines(homebrew) package, which has an rl command which has similar functionality to shuf.
brew install randomize-lines
Usage: rl [OPTION]... [FILE]...
Randomize the lines of a file (or stdin).
-c, --count=N select N lines from the file
-r, --reselect lines may be selected multiple times
-o, --output=FILE
send output to file
-d, --delimiter=DELIM
specify line delimiter (one character)
-0, --null set line delimiter to null character
(useful with find -print0)
-n, --line-number
print line number with output lines
-q, --quiet, --silent
do not output any errors or warnings
-h, --help display this help and exit
-V, --version output version information and exit
This is a python script that I saved as rand.py in my home folder:
#!/bin/python
import sys
import random
if __name__ == '__main__':
with open(sys.argv[1], 'r') as f:
flist = f.readlines()
random.shuffle(flist)
for line in flist:
print line.strip()
On Mac OSX sort -R and shuf are not available so you can alias this in your bash_profile as:
alias shuf='python rand.py'
If you have Scala installed, here's a one-liner to shuffle the input:
ls -1 | scala -e 'for (l <- util.Random.shuffle(io.Source.stdin.getLines.toList)) println(l)'
This bash function has the minimal dependency(only sort and bash):
shuf() {
while read -r x;do
echo $RANDOM$'\x1f'$x
done | sort |
while IFS=$'\x1f' read -r x y;do
echo $y
done
}
In windows You may try this batch file to help you to shuffle your data.txt, The usage of the batch code is
C:\> type list.txt | shuffle.bat > maclist_temp.txt
After issuing this command, maclist_temp.txt will contain a randomized list of lines.
Hope this helps.
Not mentioned as of yet:
The unsort util. Syntax (somewhat playlist oriented):
unsort [-hvrpncmMsz0l] [--help] [--version] [--random] [--heuristic]
[--identity] [--filenames[=profile]] [--separator sep] [--concatenate]
[--merge] [--merge-random] [--seed integer] [--zero-terminated] [--null]
[--linefeed] [file ...]
msort can shuffle by line, but it's usually overkill:
seq 10 | msort -jq -b -l -n 1 -c r
Another awk variant:
#!/usr/bin/awk -f
# usage:
# awk -f randomize_lines.awk lines.txt
# usage after "chmod +x randomize_lines.awk":
# randomize_lines.awk lines.txt
BEGIN {
FS = "\n";
srand();
}
{
lines[ rand()] = $0;
}
END {
for( k in lines ){
print lines[k];
}
}

Resources