How to delete lowercase letters after the second upper case found in the line? - bash

I have a file with names:
Smith, John.
Brown, Aaron K.
And want to get:
Smith, J
Brown, A K
or better:
SmithJ
BrownAK
Can this task be solved in bash?

You can solve it with different tools and different methods. I will show two solutions using sed and one without.
Solution 1
You want to use some command on part of the line.
You can remove all non-uppercase characters from a string with echo "${string}" | tr -cd "[:upper:]".
With sed s/../../e the resulting line from the substitition is given to the shell.
Combining these give you:
sed -r 's/([^,]*)(.*)/echo "\1\$(echo "\2" | tr -cd "[:upper:]")"/e' file
Solution 2
Less creative but easier to write is temporarily splitting each line in two lines, and execute the substition on the even lines. Put the lines together and your finished.
sed -e 's/,/\n/' file | sed '0~2s/[^A-Z]//g' | paste -d '' - -
Solution 3
With the tr from the first and the paste from the second solution you can avoid sed.
Be aware that the tr characterset must include a newline.
paste -d '' <(cut -d, -f1 file) <(cut -d, -f2 file | tr -cd ':[A-Z]:\n')
IMHO the second solution looks best. The first one is slow on large files.

Related

Shell: Counting lines per column while ignoring empty ones

I am trying to simply count the lines in the .CSV per column, while at the same time ignoring empty lines.
I use below and it works for the 1st column:
cat /path/test.csv | cut -d, -f1 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 8
And below for the 2nd column:
cat /path/test.csv | cut -d, -f2 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 6
But when I try to count 3rd column, it simply Outputs the Total number of lines in the whole .CSV.
cat /path/test.csv | cut -d, -f3 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 33
#Should be: 19?
I've also tried to use awk instead of cut, but get the same issue.
I have tried creating new file thinking maybe it had some spaces in the lines, still the same.
Can someone clarify what is the difference? Betwen reading 1-2 column and the rest?
20355570_01.tif,,
20355570_02.tif,,
21377804_01.tif,,
21377804_02.tif,,
21404518_01.tif,,
21404518_02.tif,,
21404521_01.tif,,
21404521_02.tif,,
,22043764_01.tif,
,22043764_02.tif,
,22095060_01.tif,
,22095060_02.tif,
,23507574_01.tif,
,23507574_02.tif,
,,23507574_03.tif
,,23507804_01.tif
,,23507804_02.tif
,,23507804_03.tif
,,23509247_01.tif
,,23509247_02.tif
,,23509247_03.tif
,,23527663_01.tif
,,23527663_02.tif
,,23527663_03.tif
,,23527908_01.tif
,,23527908_02.tif
,,23527908_03.tif
,,23535506_01.tif
,,23535506_02.tif
,,23535562_01.tif
,,23535562_02.tif
,,23535636_01.tif
,,23535636_02.tif
That happens when input file has DOS line endings (\r\n). Fix your file using dos2unix and your command will work for 3rd column too.
dos2unix /path/test.csv
Or, you can remove the \r at the end while counting non-empty columns using awk:
awk -F, '{sub(/\r/,"")} $3!=""{n++} END{print n}' /path/test.csv
The problem is in the grep command: the way you wrote it will return 33 lines when you count the 3rd column.
It's better instead to use the following command to count number of lines in .CSV for each column (example below is for the 3rd column):
cat /path/test.csv | cut -d , -f3 | grep -cve '^\s*$'
This will return the exact number of lines for each column and avoid of piping into wc.
See previous post here:
count (non-blank) lines-of-code in bash
edit: I think oguz ismail found the actual reason in their answer. If they are right and your file has windows line endings you can use one of the following commands without having to convert the file.
cut -d, -f3 yourFile.csv cut | tr -d \\r | grep -c .
cut -d, -f3 yourFile.csv | grep -c $'[^\r]' # bash only
old answer: Since I cannot reproduce your problem with the provided input I take a wild guess:
The "empty" fields in the last column contain spaces. A field containing a space is not empty altough it looks like it is empty as you cannot see spaces.
To count only fields that contain something other than a space adapt your regex from . (any symbol) to [^ ] (any symbol other than space).
cut -d, -f3 yourFile.csv | grep -c '[^ ]'

Deleting the nth character of a string in UNIX [duplicate]

I need to cut letter X out of a word:
For example: I need to cut the first letter out of Star Wars, the fourth out of munich,...
1 star wars
4 munich
5 casino royale
7 the fast and the furious
52 a fish called wanda
to
tar wars
munch
casio royale
the fat and the furious
a fish called wanda
I already tried it with cut, but it didn't work.
This was my command:
sed 's/^\([0-9]*\) \(.*\)/ echo \2 | cut -c \1/'
So it gave me this output:
echo star wars | cut -c 5
echo munich | cut -c 5
echo casino royale | cut -c 5
echo the fast and the furious | cut -c 5
echo a fish called wanda | cut -c 52
And than if I send it to bash. I only get the X th letter of the word.
I need to do the exercise with sed and other commands. But I can't use awk or perl.
Thanks
You can use just bash and its parameter expansion:
while read n s ; do
echo "${s:0:n-1}${s:n}"
done < input.txt
If you need one line, just remove the newlines and add a semicolon:
while read n s ; do echo "${s:0:n-1}${s:n}" ; done < input.txt
If you really need to use sed and cut, it's also doable, but a bit less readable:
cat -n input.txt \
| sed 's/\t\([0-9]\+\).*/s=\\(.\\{\1\\}\\).=\\1=/' \
| sed -f- <(sed 's/[0-9]*//' input.txt) \
| cut -c2-
Explanation:
number the lines
turn each line into a sed command that searches for the given number of characters and removes the one following them
run the generated sed command on the original file with the numbers removed
remove the extra leading space
This might work for you (GNU sed):
sed 's/^\([0-9]*\) \(.*\)/echo '\''\2'\''|sed '\''s\/.\/\/\1'\''/e' file
This uses the e flag of the s command to evaluate the RHS and runs a second sed invocation using the backreferences from the LHS. Perhaps easier on the eye is this:
sed -r 's/^([0-9]*) (.*)/echo "\2"|sed "s#.##\1"/e' file
you can use sed in this way:
sed -e '1s/\([a-z]\)\{1\}//' -e 's/^[0-9]\+\s\+\(.*\)/\1/g' file.txt
The first sed regular expresion works on the first line and replace the first character and the second regular expresion works with rest of text, 1 column: one number or more, 2 column; one space or more, after this I put the remaining test in one match \(.*\) and replaced all with this match.

How do I cut letter X out of a word?

I need to cut letter X out of a word:
For example: I need to cut the first letter out of Star Wars, the fourth out of munich,...
1 star wars
4 munich
5 casino royale
7 the fast and the furious
52 a fish called wanda
to
tar wars
munch
casio royale
the fat and the furious
a fish called wanda
I already tried it with cut, but it didn't work.
This was my command:
sed 's/^\([0-9]*\) \(.*\)/ echo \2 | cut -c \1/'
So it gave me this output:
echo star wars | cut -c 5
echo munich | cut -c 5
echo casino royale | cut -c 5
echo the fast and the furious | cut -c 5
echo a fish called wanda | cut -c 52
And than if I send it to bash. I only get the X th letter of the word.
I need to do the exercise with sed and other commands. But I can't use awk or perl.
Thanks
You can use just bash and its parameter expansion:
while read n s ; do
echo "${s:0:n-1}${s:n}"
done < input.txt
If you need one line, just remove the newlines and add a semicolon:
while read n s ; do echo "${s:0:n-1}${s:n}" ; done < input.txt
If you really need to use sed and cut, it's also doable, but a bit less readable:
cat -n input.txt \
| sed 's/\t\([0-9]\+\).*/s=\\(.\\{\1\\}\\).=\\1=/' \
| sed -f- <(sed 's/[0-9]*//' input.txt) \
| cut -c2-
Explanation:
number the lines
turn each line into a sed command that searches for the given number of characters and removes the one following them
run the generated sed command on the original file with the numbers removed
remove the extra leading space
This might work for you (GNU sed):
sed 's/^\([0-9]*\) \(.*\)/echo '\''\2'\''|sed '\''s\/.\/\/\1'\''/e' file
This uses the e flag of the s command to evaluate the RHS and runs a second sed invocation using the backreferences from the LHS. Perhaps easier on the eye is this:
sed -r 's/^([0-9]*) (.*)/echo "\2"|sed "s#.##\1"/e' file
you can use sed in this way:
sed -e '1s/\([a-z]\)\{1\}//' -e 's/^[0-9]\+\s\+\(.*\)/\1/g' file.txt
The first sed regular expresion works on the first line and replace the first character and the second regular expresion works with rest of text, 1 column: one number or more, 2 column; one space or more, after this I put the remaining test in one match \(.*\) and replaced all with this match.

How to remove the last character from a bash grep output

COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2`
outputs something like this
"Abc Inc";
What I want to do is I want to remove the trailing ";" as well. How can i do that? I am a beginner to bash. Any thoughts or suggestions would be helpful.
This will remove the last character contained in your COMPANY_NAME var regardless if it is or not a semicolon:
echo "$COMPANY_NAME" | rev | cut -c 2- | rev
I'd use sed 's/;$//'. eg:
COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2 | sed 's/;$//'`
foo="hello world"
echo ${foo%?}
hello worl
I'd use head --bytes -1, or head -c-1 for short.
COMPANY_NAME=`cat file.txt | grep "company_name" | cut -d '=' -f 2 | head --bytes -1`
head outputs only the beginning of a stream or file. Typically it counts lines, but it can be made to count characters/bytes instead. head --bytes 10 will output the first ten characters, but head --bytes -10 will output everything except the last ten.
NB: you may have issues if the final character is multi-byte, but a semi-colon isn't
I'd recommend this solution over sed or cut because
It's exactly what head was designed to do, thus less command-line options and an easier-to-read command
It saves you having to think about regular expressions, which are cool/powerful but often overkill
It saves your machine having to think about regular expressions, so will be imperceptibly faster
I believe the cleanest way to strip a single character from a string with bash is:
echo ${COMPANY_NAME:: -1}
but I haven't been able to embed the grep piece within the curly braces, so your particular task becomes a two-liner:
COMPANY_NAME=$(grep "company_name" file.txt); COMPANY_NAME=${COMPANY_NAME:: -1}
This will strip any character, semicolon or not, but can get rid of the semicolon specifically, too.
To remove ALL semicolons, wherever they may fall:
echo ${COMPANY_NAME/;/}
To remove only a semicolon at the end:
echo ${COMPANY_NAME%;}
Or, to remove multiple semicolons from the end:
echo ${COMPANY_NAME%%;}
For great detail and more on this approach, The Linux Documentation Project covers a lot of ground at http://tldp.org/LDP/abs/html/string-manipulation.html
Using sed, if you don't know what the last character actually is:
$ grep company_name file.txt | cut -d '=' -f2 | sed 's/.$//'
"Abc Inc"
Don't abuse cats. Did you know that grep can read files, too?
The canonical approach would be this:
grep "company_name" file.txt | cut -d '=' -f 2 | sed -e 's/;$//'
the smarter approach would use a single perl or awk statement, which can do filter and different transformations at once. For example something like this:
COMPANY_NAME=$( perl -ne '/company_name=(.*);/ && print $1' file.txt )
don't have to chain so many tools. Just one awk command does the job
COMPANY_NAME=$(awk -F"=" '/company_name/{gsub(/;$/,"",$2) ;print $2}' file.txt)
In Bash using only one external utility:
IFS='= ' read -r discard COMPANY_NAME <<< $(grep "company_name" file.txt)
COMPANY_NAME=${COMPANY_NAME/%?}
Assuming the quotation marks are actually part of the output, couldn't you just use the -o switch to return everything between the quote marks?
COMPANY_NAME="\"ABC Inc\";" | echo $COMPANY_NAME | grep -o "\"*.*\""
you can strip the beginnings and ends of a string by N characters using this bash construct, as someone said already
$ fred=abcdefg.rpm
$ echo ${fred:1:-4}
bcdefg
HOWEVER, this is not supported in older versions of bash.. as I discovered just now writing a script for a Red hat EL6 install process. This is the sole reason for posting here.
A hacky way to achieve this is to use sed with extended regex like this:
$ fred=abcdefg.rpm
$ echo $fred | sed -re 's/^.(.*)....$/\1/g'
bcdefg
Some refinements to answer above. To remove more than one char you add multiple question marks. For example, to remove last two chars from variable $SRC_IP_MSG, you can use:
SRC_IP_MSG=${SRC_IP_MSG%??}
cat file.txt | grep "company_name" | cut -d '=' -f 2 | cut -d ';' -f 1
I am not finding that sed 's/;$//' works. It doesn't trim anything, though I'm wondering whether it's because the character I'm trying to trim off happens to be a "$". What does work for me is sed 's/.\{1\}$//'.

shell replace cr\lf by comma

I have input.txt
1
2
3
4
5
I need to get such output.txt
1,2,3,4,5
How to do it?
Try this:
tr '\n' ',' < input.txt > output.txt
With sed, you could use:
sed -e 'H;${x;s/\n/,/g;s/^,//;p;};d'
The H appends the pattern space to the hold space (saving the current line in the hold space). The ${...} surrounds actions that apply to the last line only. Those actions are: x swap hold and pattern space; s/\n/,/g substitute embedded newlines with commas; s/^,// delete the leading comma (there's a newline at the start of the hold space); and p print. The d deletes the pattern space - no printing.
You could also use, therefore:
sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'
The -n suppresses default printing so the final d is no longer needed.
This solution assumes that the CRLF line endings are the local native line ending (so you are working on DOS) and that sed will therefore generate the local native line ending in the print operation. If you have DOS-format input but want Unix-format (LF only) output, then you have to work a bit harder - but you also need to stipulate this explicitly in the question.
It worked OK for me on MacOS X 10.6.5 with the numbers 1..5, and 1..50, and 1..5000 (23,893 characters in the single line of output); I'm not sure that I'd want to push it any harder than that.
In response to #Jonathan's comment to #eumiro's answer:
tr -s '\r\n' ',' < input.txt | sed -e 's/,$/\n/' > output.txt
tr and sed used be very good but when it comes to file parsing and regex you can't beat perl
(Not sure why people think that sed and tr are closer to shell than perl... )
perl -pe 's/\n/$1,/' your_file
if you want pure shell to do it then look at string matching
${string/#substring/replacement}
Use paste command. Here is using pipes:
echo "1\n2\n3\n4\n5" | paste -s -d, /dev/stdin
Here is using a file:
echo "1\n2\n3\n4\n5" > /tmp/input.txt
paste -s -d, /tmp/input.txt
Per man pages the s concatenates all lines and d allows to define the delimiter character.
Awk versions:
awk '{printf("%s,",$0)}' input.txt
awk 'BEGIN{ORS=","} {print $0}' input.txt
Output - 1,2,3,4,5,
Since you asked for 1,2,3,4,5, as compared to 1,2,3,4,5, (note the comma after 5, most of the solutions above also include the trailing comma), here are two more versions with Awk (with wc and sed) to get rid of the last comma:
i='input.txt'; awk -v c=$(wc -l $i | cut -d' ' -f1) '{printf("%s",$0);if(NR<c){printf(",")}}' $i
awk '{printf("%s,",$0)}' input.txt | sed 's/,\s*$//'
printf "1\n2\n3" | tr '\n' ','
if you want to output that to a file just do
printf "1\n2\n3" | tr '\n' ',' > myFile
if you have the content in a file do
cat myInput.txt | tr '\n' ',' > myOutput.txt
python version:
python -c 'import sys; print(",".join(sys.stdin.read().splitlines()))'
Doesn't have the trailing comma problem (because join works that way), and splitlines splits data on native line endings (and removes them).
cat input.txt | sed -e 's|$|,|' | xargs -i echo "{}"

Resources