Use sed to extract ascii hex string from a file - bash

I have a file that looks like this:
$ some random
$ text
00ab2c3f03$ and more
random text
1a2bf04$ more text
blah blah
and the code that looks like this:
sed -ne 's/\(.*\)$ and.*/\1/p' "file.txt" > "output1.txt"
sed -ne 's/\(.*\)$ more.*/\1/p' "file.txt" > "output2.txt"
That gives me this 00ab2c3f03 and this 1a2bf04
So it extracts anything from the beginning of the line to the shell prompt and stores it in the file, twice for two different instances.
The problem is that the file sometimes looks like this:
/dir # some random
/dir # text
00ab2c3f03/dir # and more
random text
345fabd0067234234/dir # more text
blah blah
And I want to make an universal extractor that either:
extracts data from the beginning of the line to the '$' OR '/' characters
intelligently extracts random amount of random hex data from the beginning of the line up to the first non-hex digit
But I'm not so good with sed to actually think of an easy solution by myself...

I think you want the output like this,
$ cat file
$ some random
$ text
00ab2c3f03$ and more
random text
1a2bf04$ more text
blah blah
/dir # some random
/dir # text
00ab2c3f03/dir # and more
random text
345fabd0067234234/dir # more text
blah blah
$ sed -ne 's/\([a-f0-9]*\).* and more.*/\1/p' file
00ab2c3f03
00ab2c3f03
$ sed -ne 's/\([a-f0-9]*\).* more text.*/\1/p' file
1a2bf04
345fabd0067234234
You could try the below GNU sed command also. Because / present in your input, i changed the sed delimiter to ~,
$ sed -nr 's~([a-f0-9]*)\/*\$*.* and more.*~\1~p' file
00ab2c3f03
00ab2c3f03
$ sed -nr 's~([a-f0-9]*)\/*\$*.* more text.*~\1~p' file
1a2bf04
345fabd0067234234
Explanation:
([a-f0-9]*) - Captures all the hexdigits and stored it into a group.
OP said there may be chance of / or $ symbol present just after the hex digits so the regex should be \/*\$*(/ zero or more times, $ zero or more times) after capturing group.
First command only works on the lines which contains the strings and more.
And the second one only works on the lines which contain more text because op want the two outputs in two different files.

This seems better to me:
sed -nr 's#([[:xdigit:]]+)[$/].*#\1#p' file

Related

How to extract only the English words and leaving the Devanagari words in bash script?

The text file is like this,
#एक
1के
अंकगणित8IU
अधोरेखाunderscore
$thatऔर
%redएकyellow
$चिह्न
अंडरस्कोर#_
The desired text file should be like,
#
1
8IU
underscore
$that
%redyellow
$
#_
This is what I have tried so far, using awk
awk -F"[अ-ह]*" '{print $1}' filename.txt
And the output that I am getting is,
#
1
$that
%red
$
and using this awk -F"[अ-ह]*" '{print $1,$2}' filename.txt and I am getting an output like this,
#
1 े
ं
ो
$that
%red yellow
$ ि
ं
Is there anyway to solve this in bash script?
Using perl:
$ perl -CSD -lpe 's/\p{Devanagari}+//g' input.txt
#
1
8IU
underscore
$that
%redyellow
$
#_
-CSD tells perl that standard streams and any opened files are encoded in UTF-8. -p loops over input files printing each line to standard output after executing the script given by -e. If you want to modify the file in place, add the -i option.
The regular expression matches any codepoints assigned to the Devanagari script in the Unicode standard and removes them. Use \P{Devanagari} to do the opposite and remove the non-Devanagari characters.
Using awk you can do:
awk '{sub(/[^\x00-\x7F]+/, "")} 1' file
#
1
8IU
underscore
$that
%redyellow
See documentation: https://www.gnu.org/software/gawk/manual/html_node/Bracket-Expressions.html
using [\x00-\x7F].
This matches all values numerically between zero and 127, which is the defined range of the ASCII character set. Use a complemented character list [^\x00-\x7F] to match any single-byte characters that are not in the ASCII range.
tr is a very good fit for this task:
LC_ALL=C tr -c -d '[:cntrl:][:graph:]' < input.txt
It sets the POSIX C locale environment so that only US English character set is valid.
Then instructs tr to -d delete -c complement [:cntrl:][:graph:], control and drawn characters classes (those not control or visible) characters. Since it is sets all the locale setting to C, all non-US-English characters are discarded.

How to split a text file content by a string?

Suppose I've got a text file that consists of two parts separated by delimiting string ---
aa
bbb
---
cccc
dd
I am writing a bash script to read the file and assign the first part to var part1 and the second part to var part2:
part1= ... # should be aa\nbbb
part2= ... # should be cccc\ndd
How would you suggest write this in bash ?
You can use awk:
foo="$(awk 'NR==1' RS='---\n' ORS='' file.txt)"
bar="$(awk 'NR==2' RS='---\n' ORS='' file.txt)"
This would read the file twice, but handling text files in the shell, i.e. storing their content in variables should generally be limited to small files. Given that your file is small, this shouldn't be a problem.
Note: Depending on your actual task, you may be able to just use awk for the whole thing. Then you don't need to store the content in shell variables, and read the file twice.
A solution using sed:
foo=$(sed '/^---$/q;p' -n file.txt)
bar=$(sed '1,/^---$/b;p' -n file.txt)
The -n command line option tells sed to not print the input lines as it processes them (by default it prints them). sed runs a script for each input line it processes.
The first sed script
/^---$/q;p
contains two commands (separated by ;):
/^---$/q - quit when you reach the line matching the regex ^---$ (a line that contains exactly three dashes);
p - print the current line.
The second sed script
1,/^---$/b;p
contains two commands:
1,/^---$/b - starting with line 1 until the first line matching the regex ^---$ (a line that contains only ---), branch to the end of the script (i.e. skip the second command);
p - print the current line;
Using csplit:
csplit --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}" && sed -i '/---/d' foo_bar*
If version of coreutils >= 8.22, --suppress-matched option can be used and sed processing is not required, like
csplit --suppress-matched --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}".

How to add a header to text file in bash?

I have a text file and want to convert it to csv file before to convert it, i want to add a header to text file so that the csv file has the same header. I have one thousand columns in text file and want to have one thousand column name. As a side note, the content of the text file is just rows of some numbers which is separated by comma ",". Is there any way to add the header line in bash?
I tried the way below and didn't work. I did the command below first in python.
> for i in range(1001):
> print "col" + "_" + "i"
save the output of this in text file with this command (python header.py >> header.txt) and add the output of this in format of text file to the original text file that i have like below:
cat header.txt filename.txt > newfilename.txt
then convert the txt file to csv file with "mv newfilename.txt newfilename.csv".
But unfortunately this way doesn't work as the header line has double number of other rows for some reason. I would appreciate any help to make this problem solve.
based on the description your file is already comma separated, so is a csv file. You just want to add a column number header line.
$ awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", $i,(i==NF?ORS:FS)}1' file
will add column headers as many as the fields in the first row of the file
e.g.
$ seq 5 | paste -sd, | # create 1,2,3,4,5 as a test input
awk -F, 'NR==1{for(i=1;i<=NF;i++) printf "col_%d%s", i, (i==NF?ORS:FS)}1'
col_1,col_2,col_3,col_4,col_5
1,2,3,4,5
You can generate the column names in bash using one of the options below. Each example generates a header.txt file. You already have code to add this to the beginning of your file as a header.
Using bash loops
Bash loops for this many iterations will be inefficient, but will work.
for i in {1..10}; do
echo -n "col_$i "
done > header.txt
echo >> header.txt
or using seq
for i in $(seq 1 1000); do
echo -n "col_$i "
done > header.txt
echo >> header.txt
Using seq only
Using seq alone will be more efficient.
seq -f "col_%g" -s" " 1 1000 > header.txt
Use seq and sed
You can use the seq utility to construct your CSV header, with a little minor help from Bash expansions. You can then insert the new header row into your existing CSV file, or concatenate the header with your data.
For example:
# construct a quoted CSV header
columns=$(seq -f '"col_%g"' -s', ' 1 1001)
# strip the trailing comma
columns="${columns%,*}"
# insert headers as first line of foo.csv with GNU sed
sed -i -e "1 i\\${columns}" /tmp/foo.csv
Caveats
If you don't have GNU sed, you can also use cat, sponge, or other tools to concatenate your header and data, although most of your concatenation options will require redirection to a new combined file to avoid clobbering your existing data.
For example, given /tmp/data.csv as your original data file:
seq -f '"col_%g"' -s', ' 1 1001 > /tmp/header.csv
sed -i -e 's/,[[:space:]]*$//' /tmp/header.csv
cat /tmp/header /tmp/data > /tmp/new_file.csv
Also, note that while Bash solutions that avoid calling standard utilities are possible, doing it in pure Bash might be too slow or memory intensive for large data sets.
Your mileage may vary.
printf "col%s," {1..100} |
sed 's/,$//' |
cat - filename.txt >newfilename.txt
I believe sed should supply the missing final newline as a side effect. If not, maybe try 's/,$/\n/' though this isn't entirely portable, either. You could probably replace the cat with sed as well, something like
... | sed 's/,$//;r filename.txt'
but again, I'm not entirely sure how portable this is.

Use awk to extract value from a line

I have these two lines within a file:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
where I'd like to get the following as output using awk or sed:
3
50000
Using this sed command does not work as I had hoped, and I suspect this is due to the presence of the quotes and delimiters in my line entry.
sed -n '/WORD1/,/WORD2/p' /path/to/file
How can I extract the values I want from the file?
awk -F'[<>]' '{print $3}' input.txt
input.txt:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
Output:
3
50000
sed -e 's/[a-zA-Z.<\/>= \-]//g' file
Using sed:
sed -E 's/.*limit"*>([0-9]+)<.*/\1/' file
Explanation:
.* takes care of everything that comes before the string limit
limit"* takes care of both the lines, one with limit" and the other one with just limit
([0-9]+) takes care of matching numbers and only numbers as stated in your requirement.
\1 is actually a shortcut for capturing pattern. When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. For more details, please refer https://www.inkling.com/read/introducing-regular-expressions-michael-fitzgerald-1st/chapter-4/capturing-groups-and
The script solution with parameter expansion:
#!/bin/bash
while read line || test -n "$line" ; do
value="${line%<*}"
printf "%s\n" "${value##*\>}"
done <"$1"
output:
$ ./ltags.sh dat/ltags.txt
3
50000
Looks like XML to me, so assuming it forms part of some valid XML, e.g.
<root>
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
</root>
You can use Perl's XML::Simple and do something like this:
perl -MXML::Simple -E '$xml = XMLin("file"); say $xml->{"first-value"}->{"content"}; say $xml->{"second-value-limit"}'
Output:
3
50000
If the XML structure is more complicated, then you may have to drill down a bit deeper to get to the values you want. If that's the case, you should edit the question to show the bigger picture.
Ashkan's awk solution is straightforward, but let me suggest a sed solution that accepts non-integer numbers:
sed -n 's/[^>]*>\([.[:digit:]]*\)<.*/\1/p' input.txt
This extracts the number between the first > character of the line and the following <. In my RE this "number" can be the empty string, if you don't want to accept an empty string please add the -r option to sed and replace \([.[:digit:]]*\) by ([.[:digit:]]+).

How to apply two different sed commands on a line?

Q1:
I would like to edit a file containing a set of email ids such that all the domain names become generic.
Example,
peter#yahoo.com
peter#hotmail.co.in
philip#gmail.com
to
peter_yahoo#generic.com
peter_hotmail#generic.com
philip_gmail#generic.com
I used the following sed cmd to replace # with _
sed 's/#/_/' <filename>
Is there a way to append another sed cmd to the cmd mentioned above such that I can replace the last part of the domain names with #generic.com?
Q2:
so how do I approach this if I had text at the end of my domain names?
Example,
peter#yahoo.com,i am peter
peter#hotmail.co.in,i am also peter
To,
peter_yahoo.com#generic.com,i am peter
peter_hotmail.co.in#generic.com,i am also peter
I tried #(,) instead of #(.*)
it doesn't work and I cant think of any other solution
Q3:
Suppose if my example is like this,
peter#yahoo.com
peter#hotmail.co.in,i am peter
I want my result to be as follows,
peter_yahoo.com#generic.com
peter_hotmail.co.in#generic.com,i am peter,i am peter
How do i do this with a single sed cmd?
The following cmd would result in,
sed -r 's!#(.*)!_\1#generic.com!' FILE
peter_yahoo.com#generic.com
peter_hotmail.co.in,i am peter,i am peter#generic.com
And the following cmd wont work on "peter#yahoo.com",
sed -r 's!#(.*)(,.*)!_\1#generic.com!' FILE
Thanks!!
Golfing =)
$ cat FILE
Example,
peter#yahoo.com
peter#hotmail.co.in
philip#gmail.com
$ sed -r 's!#(.*)!_\1#generic.com!' FILE
Example,
peter_yahoo.com#generic.com
peter_hotmail.co.in#generic.com
philip_gmail.com#generic.com
In reply to user1428900, this is some explanations :
sed -r # sed in extended regex mode
s # substitution
! # my delimiter, pick up anything you want instead !part of regex
#(.*) # a literal "#" + capture of the rest of the line
! # middle delimiter
_\1#generic.com # an "_" + the captured group N°1 + "#generic.com"
! # end delimiter
FILE # file-name
Extended mode isn't really needed there, consider the same following snippet in BRE (basic regex) mode :
sed 's!#\(.*\)!_\1#generic.com!' FILE
Edit to fit your new needs :
$ cat FILE
Example,
peter#yahoo.com,I am peter
peter#hotmail.co.in
philip#gmail.com
$ sed -r 's!#(.*),.*!_\1#generic.com!' FILE
Example,
peter_yahoo.com#generic.com
peter#hotmail.co.in
philip#gmail.com
If you want only email lines, you can do something like that :
sed -r '/#/s!#(.*),.*!_\1#generic.com!' FILE
the /#/ part means to only works on the lines containing the character #
Edit2:
if you want to keep the end lines like your new comments said :
sed -r 's!#(.*)(,.*)!_\1#generic.com\2!' FILE
You can run multiple commands with:
sed -e cmd -e cmd
or
sed -e cmd;cmd
So, in your case you could do:
sed -e 's/#/_/' -e 's/_.*/_generic.com/' filename
but it seems easier to just do
sed 's/#.*/_generic.com/' filename
sed 's/\(.*\)#\(.*\)\..*/\1_\2#generic.com/'
Expression with escaped parentheses \(.*\) is used to remember portions of the regular expression. The "\1" is the first remembered pattern, and the "\2" is the second remembered pattern.
The expression \(.*\) before the # is used to remember beginning of the email id (peter, peter, philip).
The expression \(.*\)\. after the # is used to remember ending of the email id (yahoo, hotmail, gmail). In other words, it says: take something between # and .
The expression .* at the end is used to match all trailing symbols in the e-mail id (.com, .co.in, .co.in).

Resources