How to change characters in a file? - bash

I have a file that contains many characters. I need to count how many times each character
is shown in the file (The file contains more than one " " between each word).
I figured that the best way to do so is using tr -s " " "/n"
and then using sort. That way I can easily use egerp -c to count the characters.
But how do i use the tr command properly?
I seem to be unable to use it and put it into a variable.

The easiest implementation would probably be to add a \n after each char,
then to sort them and count them:
$ cat file
foo bar baz.
$ sed 's/./&\n/g' file | sort | uniq -c
1
2
1 .
2 a
2 b
1 f
2 o
1 r
1 z
You can probably do something like that with bash's associative arrays, but it would be tricky and you couldn't count \0 characters anyway.

Using sed in regular expression mode may help you If I understood your problem correctly
sed -r 's/(.){1}/\1\n/g' your_file.txt | sort | uniq -c
You tell sed to capture any character that appears once with a regexp group ( the (.){1} part ) and the substitute it by the group ( \1 ) and then put \n to have one per line. Next, you can use sort and uniq -c to make that count for you. This will include non-printable characters, you can avoid counting non-printable characters by introducing some changes in the sed:
sed -r 's/[^[[:graph:]]]*//g;s/([[:graph:]]){1}/\1\n/g' your_file.txt | sort -n | uniq -c
First delete non-printable characters and the substitute printable characters by itself plus \n

Related

Counting number of different words in a txt file in Bash

Well, I do not know much about programming at bash, I'm new at it so I'm struggling to find a code to iterate all the lines in a txt file, and count how many words are different.
Example: If a txt file has "Nory was a Catholic because her mother was a Catholic"
So the result must be 7
$ grep -o '[^[:space:]]*' file | sort -u | wc -l
7
Sure. I assume you are ok with defining "words" as things that are separated by space? In which case, try something like this:
cat filename | sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" | sort -u | wc -l
This command says:
Dump contents of filename
Replace multiple spaces with a single space
Replace spaces with newline
Sort and "uniquify" the list
Print out the count of lines
Per the comment, you can technically get away without using cat if you'd like, with the following:
sed -r -e "s/[ ]+/ /g" -e "s/ /\n/g" filename | sort -u | wc -l
Further, from another comment, you could optionally use tr (importantly with it's -s flag to handle repeated spaces) instead of sed with something like:
tr -s " " "\n" < filename | sort -u | wc -l
The moral of the story is there are several ways this kind of thing can be accomplished, not to mention the other full answers that are given here :-) My personal favorite answer at this point is Ed Morton's which I've upvoted accordingly.
You could also lowercase the text so words compares regardless of casing.
Also filter words with the [:alnum:] character class, rather than [a-zA-Z0-9_] that is only valid for US-ASCII, and will fail dramatically with Greek or Turkish.
#!/usr/bin/env bash
echo "The uniq words are the words that appears at least once, regardless of casing." |
# Turn text to lowercase
tr '[:upper:]' '[:lower:]' |
# Split alphanumeric with newlines
tr -sc '[:alnum:]' '\n' |
# Sort uniq words
sort -u |
# Count lines of unique words
wc -l
I would do it like so, with comments:
echo "Nory was a Catholic because her mother was a Catholic" |
# tr replace
# -s - squeeze
# -c - complementary
# [a-zA-Z0-9_] - all letters, number and underscore
# but complementary set, so all non letters, not numbers and not underscores.
# replace them by newline
tr -sc '[a-zA-Z0-9_]' '\n' |
# and sort unique and display count
sort -u | wc -l
Tested on repl bash.
Decided to use [a-zA-Z0-9_], because this is how GNU sed \w extension matches a word.
cat yourfile.txt | xargs -n1 | sort | uniq -c > youroutputfile.txt
xargs -n1 = put one word per line
sort = sorts
uniq -c = counts occurrences of distinct values
source

Bash - Read in a file and replace multiple spaces with just one comma

I'm trying to write a bash script that will take in a file with spaces and output the same file, but comma delimited. I figured out how to replaces spaces with commas, but I've run into a problem: there are some rows that have a variable number of spaces. Some rows contain 2 or 3 spaces and some contain as many as 7 or 13. Here's what I have so far:
sed 's/ /,/g' $varfile > testdone.txt
$varfile is the file name that the user gives.
But I'm not sure how to fix the variable space problem. Any suggestions are welcome. Thank you.
This is not a job for sed. tr is more appropriate:
$ printf 'foo bar\n' | tr -s ' ' ,
foo,bar
The -s tells tr to squash multiple occurrences. Also, you can generalize with tr -s '[:space:]' , (which will replace newlines, perhaps undesirable) or tr -s ' \t' , to handle spaces or tabs.
You just need to use the + quantifier to match one or more
Assuming GNU sed
sed 's/ \+/,/g' file
# or
sed -E 's/ +/,/g' file
With GNU basic regular expressions, the "one or more" quantifier is \+
With GNU extended regular expressions, the "one or more" quantifier is +

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format [duplicate]

I can't figure how to tell sed dot match new line:
echo -e "one\ntwo\nthree" | sed 's/one.*two/one/m'
I expect to get:
one
three
instead I get original:
one
two
three
sed is line-based tool. I don't think these is an option.
You can use h/H(hold), g/G(get).
$ echo -e 'one\ntwo\nthree' | sed -n '1h;1!H;${g;s/one.*two/one/p}'
one
three
Maybe you should try vim
:%s/one\_.*two/one/g
If you use a GNU sed, you may match any character, including line break chars, with a mere ., see :
.
Matches any character, including newline.
All you need to use is a -z option:
echo -e "one\ntwo\nthree" | sed -z 's/one.*two/one/'
# => one
# three
See the online sed demo.
However, one.*two might not be what you need since * is always greedy in POSIX regex patterns. So, one.*two will match the leftmost one, then any 0 or more chars as many as possible, and then the rightmost two. If you need to remove one, then any 0+ chars as few as possible, and then the leftmost two, you will have to use perl:
perl -i -0 -pe 's/one.*?two//sg' file # Non-Unicode version
perl -i -CSD -Mutf8 -0 -pe 's/one.*?two//sg' file # S&R in a UTF8 file
The -0 option enables the slurp mode so that the file could be read as a whole and not line-by-line, -i will enable inline file modification, s will make . match any char including line break chars, and .*? will match any 0 or more chars as few as possible due to a non-greedy *?. The -CSD -Mutf8 part make sure your input is decoded and output re-encoded back correctly.
You can use python this way:
$ echo -e "one\ntwo\nthree" | python -c 'import re, sys; s=sys.stdin.read(); s=re.sub("(?s)one.*two", "one", s); print s,'
one
three
$
This reads the entire python's standard input (sys.stdin.read()), then substitutes "one" for "one.*two" with dot matches all setting enabled (using (?s) at the start of the regular expression) and then prints the modified string (the trailing comma in print is used to prevent print from adding an extra newline).
This might work for you:
<<<$'one\ntwo\nthree' sed '/two/d'
or
<<<$'one\ntwo\nthree' sed '2d'
or
<<<$'one\ntwo\nthree' sed 'n;d'
or
<<<$'one\ntwo\nthree' sed 'N;N;s/two.//'
Sed does match all characters (including the \n) using a dot . but usually it has already stripped the \n off, as part of the cycle, so it no longer present in the pattern space to be matched.
Only certain commands (N,H and G) preserve newlines in the pattern/hold space.
N appends a newline to the pattern space and then appends the next line.
H does exactly the same except it acts on the hold space.
G appends a newline to the pattern space and then appends whatever is in the hold space too.
The hold space is empty until you place something in it so:
sed G file
will insert an empty line after each line.
sed 'G;G' file
will insert 2 empty lines etc etc.
How about two sed calls:
(get rid of the 'two' first, then get rid of the blank line)
$ echo -e 'one\ntwo\nthree' | sed 's/two//' | sed '/^$/d'
one
three
Actually, I prefer Perl for one-liners over Python:
$ echo -e 'one\ntwo\nthree' | perl -pe 's/two\n//'
one
three
Below discussion is based on Gnu sed.
sed operates on a line by line manner. So it's not possible to tell it dot match newline. However, there are some tricks that can implement this. You can use a loop structure (kind of) to put all the text in the pattern space, and then do the operation.
To put everything in the pattern space, use:
:a;N;$!ba;
To make "dot match newline" indirectly, you use:
(\n|.)
So the result is:
root#u1804:~# echo -e "one\ntwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#
Note that in this case, (\n|.) matches newline and all characters. See below example:
root#u1804:~# echo -e "oneXXXXXX\nXXXXXXtwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#

Excluding only four-digit and five-digit numbers of a txt file

I have a file in linux bash, where I have a list of file names basically.
The filenames are including all kind of characters A-Z a-Z . _ and numbers.
Examples:
hello-34.87-world
foo-34578-bar
fo.23-5789-foobar
and a lot more...
The goal is, that I get a list of only the four and five digit numbers.
So Outcome should be:
34578
5789
I tought it would be a good idea, to work with vi.
So I could use only one command like:
:%s/someregularexpression//g
Thanks for your help.
Without having to store the filenames in a file
$ shopt -s extglob nullglob
$ for file in *-[0-9][0-9][0-9][0-9]?([0-9])-*; do
tmp=${file#*-} # remove up to the first hyphen
tmp=${tmp%-*} # remove the last hyphen and after
echo $tmp
done
5789
34578
Just use sed:
sed -nr 's/^[^0-9]*([0-9]{4,5}).*/\1/p' < myfile.txt
If you use vim, and a line doesn't have more than one such number per line, you can try the following:
:%s/^.\{-}\(\d\{4,5\}\).\{-}$/\1/g
And see :help \{- for non-greedy search.
This works with 1 instance per line and surrounded by 1 pair of dashes
grep -P '\d{4,5}' mytxt | \
while read buff
do
buff=${buff#*-}
echo ${buff%-*}
done
Normally you do not want to parse ls in view of spaces, newlines and other special characters. In this case you don't care.
First replace all non-numeric things into newlines.
Than only look for lines with 4 or 5 digits. After the replacement you only have digits, so this can be done by looking for 4 or 5 characters.
ls | tr -c '[^0-9]' '\n' | grep -E "^....(|.)$"
When you already have the filenames in a file and you are in vi, use
:% !tr -c '[^0-9]' '\n' | grep -E "^....(|.)$"

count number of tab characters in linux

I want to count the numbers of hard tab characters in my documents in unix shell.
How can I do it?
I tried something like
grep -c \t foo
but it gives counts of t in file foo.
Use tr to discard everything except tabs, and then count:
< input-file tr -dc \\t | wc -c
Bash uses a $'...' notation for specifying special characters:
grep -c $'\t' foo
Use a perl regex (-P option) to grep tab characters.
So, to count the number of tab characters in a file:
grep -o -P '\t' foo | wc -l
You can insert a literal TAB character between the quotes with Ctrl+V+TAB.
In general you can insert any character at all by prefixing it with Ctrl+V; even control characters such as Enter or Ctrl+C that the shell would otherwise interpret.
You can use awk in a tricky way: use tab as the record separator, then the number of tab characters is the total number of records minus 1:
ntabs=$(awk 'BEGIN {RS="\t"} END {print NR-1}' foo)
My first thought was to use sed to strip out all non-tab characters, then use wc to count the number of characters left.
< foo.txt sed 's/[^\t]//g' | wc -c
However, this also counts newlines, which sed won't touch because it is line-based. So, let's use tr to translate all the newlines into spaces, so it is one line for sed.
< foo.txt tr '\n' ' ' | sed 's/[^\t]//g' | wc -c
Depending on your shell and implementation of sed, you may have to use a literal tab instead of \t, however, with Bash and GNU sed, the above works.

Resources