Excluding only four-digit and five-digit numbers of a txt file - bash

I have a file in linux bash, where I have a list of file names basically.
The filenames are including all kind of characters A-Z a-Z . _ and numbers.
Examples:
hello-34.87-world
foo-34578-bar
fo.23-5789-foobar
and a lot more...
The goal is, that I get a list of only the four and five digit numbers.
So Outcome should be:
34578
5789
I tought it would be a good idea, to work with vi.
So I could use only one command like:
:%s/someregularexpression//g
Thanks for your help.

Without having to store the filenames in a file
$ shopt -s extglob nullglob
$ for file in *-[0-9][0-9][0-9][0-9]?([0-9])-*; do
tmp=${file#*-} # remove up to the first hyphen
tmp=${tmp%-*} # remove the last hyphen and after
echo $tmp
done
5789
34578

Just use sed:
sed -nr 's/^[^0-9]*([0-9]{4,5}).*/\1/p' < myfile.txt

If you use vim, and a line doesn't have more than one such number per line, you can try the following:
:%s/^.\{-}\(\d\{4,5\}\).\{-}$/\1/g
And see :help \{- for non-greedy search.

This works with 1 instance per line and surrounded by 1 pair of dashes
grep -P '\d{4,5}' mytxt | \
while read buff
do
buff=${buff#*-}
echo ${buff%-*}
done

Normally you do not want to parse ls in view of spaces, newlines and other special characters. In this case you don't care.
First replace all non-numeric things into newlines.
Than only look for lines with 4 or 5 digits. After the replacement you only have digits, so this can be done by looking for 4 or 5 characters.
ls | tr -c '[^0-9]' '\n' | grep -E "^....(|.)$"
When you already have the filenames in a file and you are in vi, use
:% !tr -c '[^0-9]' '\n' | grep -E "^....(|.)$"

Related

Bash - Read in a file and replace multiple spaces with just one comma

I'm trying to write a bash script that will take in a file with spaces and output the same file, but comma delimited. I figured out how to replaces spaces with commas, but I've run into a problem: there are some rows that have a variable number of spaces. Some rows contain 2 or 3 spaces and some contain as many as 7 or 13. Here's what I have so far:
sed 's/ /,/g' $varfile > testdone.txt
$varfile is the file name that the user gives.
But I'm not sure how to fix the variable space problem. Any suggestions are welcome. Thank you.
This is not a job for sed. tr is more appropriate:
$ printf 'foo bar\n' | tr -s ' ' ,
foo,bar
The -s tells tr to squash multiple occurrences. Also, you can generalize with tr -s '[:space:]' , (which will replace newlines, perhaps undesirable) or tr -s ' \t' , to handle spaces or tabs.
You just need to use the + quantifier to match one or more
Assuming GNU sed
sed 's/ \+/,/g' file
# or
sed -E 's/ +/,/g' file
With GNU basic regular expressions, the "one or more" quantifier is \+
With GNU extended regular expressions, the "one or more" quantifier is +

Text processing in bash - extracting information between multiple HTML tags and outputting it into CSV format [duplicate]

I can't figure how to tell sed dot match new line:
echo -e "one\ntwo\nthree" | sed 's/one.*two/one/m'
I expect to get:
one
three
instead I get original:
one
two
three
sed is line-based tool. I don't think these is an option.
You can use h/H(hold), g/G(get).
$ echo -e 'one\ntwo\nthree' | sed -n '1h;1!H;${g;s/one.*two/one/p}'
one
three
Maybe you should try vim
:%s/one\_.*two/one/g
If you use a GNU sed, you may match any character, including line break chars, with a mere ., see :
.
Matches any character, including newline.
All you need to use is a -z option:
echo -e "one\ntwo\nthree" | sed -z 's/one.*two/one/'
# => one
# three
See the online sed demo.
However, one.*two might not be what you need since * is always greedy in POSIX regex patterns. So, one.*two will match the leftmost one, then any 0 or more chars as many as possible, and then the rightmost two. If you need to remove one, then any 0+ chars as few as possible, and then the leftmost two, you will have to use perl:
perl -i -0 -pe 's/one.*?two//sg' file # Non-Unicode version
perl -i -CSD -Mutf8 -0 -pe 's/one.*?two//sg' file # S&R in a UTF8 file
The -0 option enables the slurp mode so that the file could be read as a whole and not line-by-line, -i will enable inline file modification, s will make . match any char including line break chars, and .*? will match any 0 or more chars as few as possible due to a non-greedy *?. The -CSD -Mutf8 part make sure your input is decoded and output re-encoded back correctly.
You can use python this way:
$ echo -e "one\ntwo\nthree" | python -c 'import re, sys; s=sys.stdin.read(); s=re.sub("(?s)one.*two", "one", s); print s,'
one
three
$
This reads the entire python's standard input (sys.stdin.read()), then substitutes "one" for "one.*two" with dot matches all setting enabled (using (?s) at the start of the regular expression) and then prints the modified string (the trailing comma in print is used to prevent print from adding an extra newline).
This might work for you:
<<<$'one\ntwo\nthree' sed '/two/d'
or
<<<$'one\ntwo\nthree' sed '2d'
or
<<<$'one\ntwo\nthree' sed 'n;d'
or
<<<$'one\ntwo\nthree' sed 'N;N;s/two.//'
Sed does match all characters (including the \n) using a dot . but usually it has already stripped the \n off, as part of the cycle, so it no longer present in the pattern space to be matched.
Only certain commands (N,H and G) preserve newlines in the pattern/hold space.
N appends a newline to the pattern space and then appends the next line.
H does exactly the same except it acts on the hold space.
G appends a newline to the pattern space and then appends whatever is in the hold space too.
The hold space is empty until you place something in it so:
sed G file
will insert an empty line after each line.
sed 'G;G' file
will insert 2 empty lines etc etc.
How about two sed calls:
(get rid of the 'two' first, then get rid of the blank line)
$ echo -e 'one\ntwo\nthree' | sed 's/two//' | sed '/^$/d'
one
three
Actually, I prefer Perl for one-liners over Python:
$ echo -e 'one\ntwo\nthree' | perl -pe 's/two\n//'
one
three
Below discussion is based on Gnu sed.
sed operates on a line by line manner. So it's not possible to tell it dot match newline. However, there are some tricks that can implement this. You can use a loop structure (kind of) to put all the text in the pattern space, and then do the operation.
To put everything in the pattern space, use:
:a;N;$!ba;
To make "dot match newline" indirectly, you use:
(\n|.)
So the result is:
root#u1804:~# echo -e "one\ntwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#
Note that in this case, (\n|.) matches newline and all characters. See below example:
root#u1804:~# echo -e "oneXXXXXX\nXXXXXXtwo\nthree" | sed -r ':a;N;$!ba;s/one(\n|.)*two/one/'
one
three
root#u1804:~#

How to change characters in a file?

I have a file that contains many characters. I need to count how many times each character
is shown in the file (The file contains more than one " " between each word).
I figured that the best way to do so is using tr -s " " "/n"
and then using sort. That way I can easily use egerp -c to count the characters.
But how do i use the tr command properly?
I seem to be unable to use it and put it into a variable.
The easiest implementation would probably be to add a \n after each char,
then to sort them and count them:
$ cat file
foo bar baz.
$ sed 's/./&\n/g' file | sort | uniq -c
1
2
1 .
2 a
2 b
1 f
2 o
1 r
1 z
You can probably do something like that with bash's associative arrays, but it would be tricky and you couldn't count \0 characters anyway.
Using sed in regular expression mode may help you If I understood your problem correctly
sed -r 's/(.){1}/\1\n/g' your_file.txt | sort | uniq -c
You tell sed to capture any character that appears once with a regexp group ( the (.){1} part ) and the substitute it by the group ( \1 ) and then put \n to have one per line. Next, you can use sort and uniq -c to make that count for you. This will include non-printable characters, you can avoid counting non-printable characters by introducing some changes in the sed:
sed -r 's/[^[[:graph:]]]*//g;s/([[:graph:]]){1}/\1\n/g' your_file.txt | sort -n | uniq -c
First delete non-printable characters and the substitute printable characters by itself plus \n

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Listing all words containing more than 1 capitalized letter

I want to search for all of the acronyms placed within a document so I can correct their formatting. I think I can assume that all acronyms are words containing at least 2 capital letters in them (e.g.: "EU"), as I've never seen a one-word acronym or acronym only containing 1 capital letter, but sometimes they have a small "o" for "of" in them or another small letter. How can I print out a list showing all of the possible matches once?
This might work for you:
tr -s '[:space:]' '\n' <input.txt | sed '/\<[[:upper:]]\{2,\}\>/!d' | sort -u
The -o option of grep can help you:
grep -o '\b[[:alpha:]]*[[:upper:]][[:alpha:]]*[[:upper:]][[:alpha:]]*'
Almost only Bash:
for word in $(cat file.txt) ; do
if [[ $word =~ [[:upper:]].*[[:upper:]] ]] ; then # at least 2 capital letters
echo "${word//[^[:alpha:]]/}" # remove non-alphabetic characters
fi
done
Will this work for you:
sed 's/[[:space:]]\+/\n/g' $your_file | sort -u | egrep '[[:upper:]].*[[:upper:]]'
Translation:
Replace all runs of whitespace in $your_file with newlines. This will put each word on its own line.
Sort the file and remove duplicates.
Find all lines that contain two uppercase letters separated by zero or more characters.
One way using perl.
Example:
Content of infile:
One T
Two T
THREE
Four
Five SIX
Running the perl command:
perl -ne 'printf qq[%s\n], $1 while /\b([[:upper:]]{2,})\b/g' infile
Result:
THREE
SIX

Resources