Using sed with delimiters similar to cut

Using sed with delimiters similar to cut - bash

Given a file foo.txt containing file names such as:
2015_275_14_1,Siboney_by_The_Tailor_Maids
2015_275_16_1,Louis_Armstrong_Cant_Give_You_Anything_But_Love
2015_275_17_1,Benny_Goodman_Trio_Nice_Work_Avalon
2015_275_18_1,Feather_On_Jazz_Jazz_In_The_Concert_Hall
2015_235_1_1,Integration_Report_1
2015_273_2_1_1,Cab_Calloway_Home_Movie_1
2015_273_2_2_1,Cab_Calloway_Home_Movie_2
I want to replace the _ in the part before the comma with . and the _ in the second part after the comma with a space.
I can accomplish each individually with:
sed -E -i '' 's/([0-9]{4})_([0-9]{3})_([0-9]{2})_([0-9])/\1.\2.\3.\4./'
for the first part, and the second part then with:
sed -E -i '' "s/_/ /g"
But I was hoping to accomplish it in an easier fashion by using cut with sed but that doesn't work:
cut -d "," -f 1 foo.txt | sed -E -i '' "s/_/./g" foo.txt && cut -d "," -f 2 foo.txt | sed -E -i '' "s/_/ /g" foo.txt
No good.
So, is there a way to accomplish this with sed or maybe awk or maybe something else where I'm treating the , as a delimiter such as in cut?
Desired output:
2015.275.14.1,Siboney by The Tailor Maids

You can use awk to attain your goal, here's the method.
$ awk -F',' '{gsub(/_/,".",$1);gsub(/_/," ",$2);printf "%s,%s\n",$1,$2}' file
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2

Similar to #CWLiu's answer but I use OFS (output field separator) instead of adding back in the comma and having to add newline from using printf.
awk -F ',' 'BEGIN {OFS = FS} {gsub(/_/, ".", $1); gsub(/_/, " ", $2); print;}' foo.txt
Explanation:
-F ',' sets the field separator
BEGIN {OFS = FS} sets the output field separator (default space) equal to the field separator so the comma is printed back out
gsub(/_/, ".", $1) global substitution on the first column
gsub(/_/, " ", $2) global substitution on the second column
print print the whole line

$ awk 'BEGIN{FS=OFS=","} {gsub(/_/,".",$1); gsub(/_/," ",$2)} 1' file
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2

Try this for GNU sed:
$ cat input.txt
2015_275_14_1,Siboney_by_The_Tailor_Maids
2015_275_16_1,Louis_Armstrong_Cant_Give_You_Anything_But_Love
2015_275_17_1,Benny_Goodman_Trio_Nice_Work_Avalon
2015_275_18_1,Feather_On_Jazz_Jazz_In_The_Concert_Hall
2015_235_1_1,Integration_Report_1
2015_273_2_1_1,Cab_Calloway_Home_Movie_1
2015_273_2_2_1,Cab_Calloway_Home_Movie_2
$ sed -r ':loop;/^[^_]+,/{s/_/ /g;bend};s/_/./;bloop;:end' input.txt
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2
Explanation:
use s/_/./ to substitute _ to . until all _ before , have been substituted, which is judged by ^[^_]+,;
then, if ^[^_]+, matches, use s/_/ /g to subtitute all _ to after ,

You could cut and paste:
$ paste -d, <(cut -d, -f1 infile | sed 'y/_/./') <(cut -d, -f2 infile | sed 'y/_/ /')
2015.275.14.1,Siboney by The Tailor Maids
2015.275.16.1,Louis Armstrong Cant Give You Anything But Love
2015.275.17.1,Benny Goodman Trio Nice Work Avalon
2015.275.18.1,Feather On Jazz Jazz In The Concert Hall
2015.235.1.1,Integration Report 1
2015.273.2.1.1,Cab Calloway Home Movie 1
2015.273.2.2.1,Cab Calloway Home Movie 2
The process substitution <() lets you treat the output of commands like a file, and paste -d, pastes the output of each command side-by-side, separated by a comma.
The sed y command transliterates characters and is, in this case, equivalent to s/_/./g. and s/_/ /g.
You could also do it purely in sed, but it's a bit unwieldy:
sed 'h;s/.*,//;y/_/ /;x;s/,.*//;y/_/./;G;s/\n/,/' infile
Explained:
h # Copy pattern space to hold space
s/.*,// # Remove first part including comma
y/_/ / # Replace all "_" by spaces in the remaining second part
x # Swap pattern and hold space
s/,.*// # Remove second part including comma
y/_/./ # Replace all "_" by periods in the remaining first part
G # Append hold space to pattern space
s/\n/,/ # Replace linebreak with comma
Or, alternatively (from comment by potong):
sed 's/,/\n/;h;y/_/ /;x;y/_/./;G;s/\n.*\n/,/' infile
Explained:
s/,/\n/ # Replace comma by linebreak
h # Copy pattern space to hold space
y/_/ / # Replace all "_" by spaces
x # Swap pattern and hold space
y/_/./ # Replace all "_" by periods
G # Append hold space
s/\n.*\n/,/ # Remove second and third line in pattern space

Related

Bash; Replacing new line with ", " and ending with ".", can someone explain awk and sed, please?

so let's say i have
aaa
bbb
ccc
ddd
and i need to replace all new lines by comma and space, and end with a dot like so
aaa, bbb, ccc, ddd.
I found this, but i can't understand it at all and i need to know what every single character does ;
... | awk '{ printf $0", " }' | sed 's/.\{2\}$/./'
Can someone make those two commands human-readable ?
tysm!

About the command:
... | awk '{ printf $0", " }' | sed 's/.\{2\}$/./'
Awk prints the line $0 followed by , without newlines. When this is done, you have , trailing at the end.
Then the pipe to sed replaces the last , with a single dot as this part .\{2\}$ matches 2 times any character at the end of the string.
With sed using a single command, you can read all lines using N to pull the next line in the pattern space, and use a label to keep on replacing a newline as long as it is not the last line last line.
After that you can append a dot to the end.
sed ':a;N;$!ba;s/\n/, /g;s/$/./' file
Output
aaa, bbb, ccc, ddd.

ok,
first of all; thank u.
I do now understand that printf $0", " means 'print every line, and ", " at the end of each'
as for the sed command, a colleague explained it to me a minute ago;
in 's/.\{2\}$/./',
s/ replace
. any character
{2} x2, so two characters
$ at end of the line
/ by ( 's/ / /' = replace/ this / that /)
. the character '.'
/ end
without forgetting to escape { and }, so u end up with
's/ . \{2\} $ / . /'
but wait, it gets even better;
my colleague also told me that \{2\} wasn't necessary in this case ;
.{2} (without the escapes) could simply be replaced by
.. 'any character' twice.
so 's/..$/./' wich is way more readable i think
'replace/ wich ever two last characters / this character/'
hope this helps if any other 42 student gets here
tism again

awk '{ printf $0", " }'
This is awk command with single action, encased in {...}, this action is applied to every line of input.
Printf is print with format, here no formatting takes places but another feature of printf is leveraged - printf does not attach output row separator (default: newline) as opposed to print.
$0 denotes whole current line (sans trailing newline).
", " is string literal for comma followed by space.
$0", " instructs awk to concatenate current line with comma and space.
Whole command thus might be read as: for every line output current line followed by comma and space
sed 's/.\{2\}$/./'
s is one of commands, namely substitute, basic form is
s/regexp/replacement/
therefore
.\{2\}$ is regular expression, . denotes any characters, \{2\} repeated 2 times, $ denotes end of line, thus this one matches 2 last characters of each line, as text was already converted to single line, it will match 2 last characters of whole text.
. is replacement, literal dot character
Whole command thus might be read as: for every line replace 2 last characters using dot

Assuming the four lines are in a file...
#!/bin/sh
cat << EOF >> ed1
%s/$/,/g
$
s/,/./
wq
EOF
ed -s file < ed1
cat file | tr '\n' ' ' > f2
mv f2 file
rm -v ./ed1

echo 'aaa
bbb
ccc
ddd' |
mawk NF+=RS FS='\412' RS= OFS='\40\454' ORS='\456\12'
aaa, bbb, ccc, ddd.

How to delete a line of the text file from the output of checklist

I have a text file:
$100 Birthday
$500 Laptop
$50 Phone
I created a --checklist from the text file
[ ] $100 Birthday
[*] $500 Laptop
[*] $50 Phone
the output is $100 $50
How can I delete the line of $100 and $50 in the text file, please?
The expected output of text file:
$100 Birthday
Thank you!

with grep and cut
grep -xf <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
with grep and sed
grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
explanation
use grep to select lines from text file
$ grep Birthday file1.txt
100 Birthday
cut will split line into columns. -f 2 will print only 2nd column but -f 2- will print everything from 2nd column. as delimiter -d whitespace ' ' is used here (some character must escaped with \)
and we can use pipe | as input (instead file)
$ echo one two three | cut -d \ -f 2-
two three
$ grep Birthday file1.txt | cut -d \ -f 2-
Birthday ^
|
(note the two whitespaces) --------+
assuming we have a text file temp.txt
$ cat temp.txt
Birthday
Laptop
Phone
grep can also read list of search patterns from another file as input instead
$ grep -f temp.txt file1.txt
100 Birthday
500 Laptop
50 Phone
or we print the file content with cat and redirect output with <
$ grep -f <(cat temp.txt) file1.txt
100 Birthday
500 Laptop
50 Phone
Now let's generate temp.txt from checklist. You only want grep lines containing [ ] and cut starting from 3rd column (again some characters have special meaning and must therefore escaped \[)
$ grep '\[ ]' file2.txt
[ ] 100 Birthday
$ grep '\[ ]' file2.txt | cut -d\ -f3-
100 Birthday
You don't need temp.txt and can therefore redirect list straight to grep -f what is called process substitution <(...)
$ grep -f <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
100 Birthday
grep read all lines from temp.txt as PATTERN and some characters have special meaning for regex. ^ stands for begin of line and $ for end of line. To be nitpicky correct search pattern should therefore be '^100 Birthday$' so it won't match "1100 Birthday 2".
You might have noticed that I dropped the $ currency in your input files for reason. You can keep it, but tell grep to take all input literally with -F and(/or?) -x flag which will search for whole line "100 Birthday" (no regex for line start/ending needed)
sed [OPTION] 's/regexp/replacement/command' [file]
sed is more common when it comes to text editing. instead grep | cut we can do it from one single command:
grep '\[ ]' | cut -f3- and sed 's/\[ ] *//'
are basically targeting the same lines and delete [ ] from it.
There are however some extra flags required, because sed is text editor and will stream the whole file by default. to emulate grep's behavior we use
-n option to suppress the input
p command to print only changes
and for regexp
\[ ] (text to replace)
' *' = ' ' (whitespace) + * (star)
meaning: repeated previous character 0 or more times, in particulary all trailing whitespaces
(replacement is empty because we want just delete)
so working similar sed command will look like this
sed -n 's/\[ ] *//p' file2.txt
And that's in my opinion all it takes for a checklist. You have however two redundant files and want match your cloned checklist against original file, so let me show you more complicated things.
Instead of deleting the checkbox let's output captured groups. This pseudo code will explain it better than me. \1 is for first capture group ( ) and so on (kinda internal variables)
$ sed 's/(aaa)b(ccc)dd/\1/'
aaa
$ sed 's/(aaa)b(ccc)dd/\2/'
ccc
$ sed 's/(aaa)b(ccc)dd/\1 \2/'
aaa ccc
$ sed 's/(aaa)b(ccc)dd/lets \1 replace \2 this/'
lets aaa replace ccc this
so in this example sed 's/\[ ] (.*)/\1/' we use for regexp
\[ ] (text to replace)
' ' (trailing whitespace)
and inside the first capture group ( ) the desired "100 Birthday"
.* = . (dot) + * (star)
meaning: repeated previous character 0 or more times (in particulary a dot here)
but the dot . itself is regex for ANY char now (special meaning)
so the capture group is all the rest of line
and for replacement we use (only)
\1 first capture group
$ sed -n 's/\[ ] (.*)/\1/p' file2.txt
100 Birthday
But there is more :)
Instead of matching only ' ' whitespace there exist another regex with special meaning (extended regex)
\s will match whitespace and tab
+ repeated previous character 1 or more times (note the difference to * 0 or more times)
\s+ will match a series of spaces
and to make it work we need one more flag
-r use extended regular expressions
so with this command you can extract all search patterns from your cloned checklist...
$ sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt
100 Birthday
...and finally let it run against your original file (without the need of temp.txt)
$ grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
100 Birthday

Grep to exclude comment like # and -- with trailing spaces and within line

I tried to grep the word inside file which contains # and -- as a comment. The command that I used is
grep "^[^#]" -H -R -I "pathtofile" | grep "^[^--]" | grep -in ${1} | awk -F : ' { print $2 } ' | uniq)
which will print the file name of specific word. However, if there is a line like this
--test_specific_word_test test
The code above will treat above code as not to skip it. This case also apply to where the comment is in line with the code like var=1 --comment.
Should I use sed to delete comment line first or use just grep.
The downside is I have a significant amount of file to search and GNU grep is 2.0 and I can't upgrade the grep version because I don't have permission.

The command you've provided uses grep 4 times. You can skip commented lines with a single grep command:
grep -v "^ *\(--\|#\)" "pathtofile"
To print the filenames containing word1 use cut like so:
grep -Hv "^ *\(--\|#\)" filenames | grep "word1" | cut -d: -f1
To skip inline comments use sed:
sed "s/\(.*\)\(--\|#\).*/\1/g" inputfile
Sample input:
word1
word2
-word3 # inline comment
#comment1
--comment2
#comment3
output:
word1
word2
-word3

If in fact you are attempting to parse a programming language's source files, you are probably better off using a proper parser. Here is an attempt at refactoring your code into an Awk script, with several guesses as to what exactly the script should actually do.
find pathtofile -type f -exec awk -v word="$1" -F : '
# this doesn't reimplement grep -I though
{ sub("(#|--).*", "") } # remove comments
tolower($0) ~ tolower(word) && !($2 in a) { print FILENAME ":" FNR ":" $2; a[$2] }' {} +
This has the obvious flaw that if the programming language allows for # or -- in quoted strings and doesn't regard those as comments, the script will do the wrong thing.
There are no word boundaries in your script, so I didn't put any in mine either. This means if word="dog" then it will print any string which contains the three adjacent letters d-o-g in this order, even in substring matches like "doggone" or "endogenous". If that's not what you want, you can add word boundary markers -- if you have GNU Awk, you can say BEGIN { word = "\\<" word "\\> } at the beginning of the script; or see here.
The technique to add the key to an array and only print the key if it wasn't already in the array is a common way to implement uniq. This will fail if find returns so many files that it will end up running more than one instance of awk -- this will be controlled by the value of ARG_MAX of your kernel.

Bash + sed/awk/cut to delete nth character

I trying to delete 6,7 and 8th character for each line.
Below is the file containing text format.
Actual output..
#cat test
18:40:12,172.16.70.217,UP
18:42:15,172.16.70.218,DOWN
Expecting below, after formatting.
#cat test
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
Even I tried with below , no luck
#awk -F ":" '{print $1":"$2","$3}' test
18:40,12,172.16.70.217,UP
#sed 's/^\(.\{7\}\).\(.*\)/\1\2/' test { Here I can remove only one character }
18:40:1,172.16.70.217,UP
Even with cut also failed
#cut -d ":" -f1,2,3 test
18:40:12,172.16.70.217,UP
Need to delete character in each line like 6th , 7th , 8th
Suggestion please

With GNU cut you can use the --complement switch to remove characters 6 to 8:
cut --complement -c6-8 file
Otherwise, you can just select the rest of the characters yourself:
cut -c1-5,9- file
i.e. characters 1 to 5, then 9 to the end of each line.
With awk you could use substrings:
awk '{ print substr($0, 1, 5) substr($0, 9) }' file
Or you could write a regular expression, but the result will be more complex.
For example, to remove the last three characters from the first comma-separated field:
awk -F, -v OFS=, '{ sub(/...$/, "", $1) } 1' file
Or, using sed with a capture group:
sed -E 's/(.{5}).{3}/\1/' file
Capture the first 5 characters and use them in the replacement, dropping the next 3.

it's a structured text, why count the chars if you can describe them?
$ awk '{sub(":..,",",")}1' file
18:40,172.16.70.217,UP
18:42,172.16.70.218,DOWN
remove the seconds.

The solutions below are generic and assume no knowledge of any format. They just delete character 6,7 and 8 of any line.
sed:
sed 's/.//8;s/.//7;s/.//6' <file> # from high to low
sed 's/.//6;s/.//6;s/.//6' <file> # from low to high (subtract 1)
sed 's/\(.....\).../\1/' <file>
sed 's/\(.{5}\).../\1/' <file>
s/BRE/replacement/n :: substitute nth occurrence of BRE with replacement
awk:
awk 'BEGIN{OFS=FS=""}{$6=$7=$8="";print $0}' <file>
awk -F "" '{OFS=$6=$7=$8="";print}' <file>
awk -F "" '{OFS=$6=$7=$8=""}1' <file>
This is 3 times the same, removing the field separator FS let awk assume a field to be a character. We empty field 6,7 and 8, and reprint the line with an output field separator OFS which is empty.
cut:
cut -c -5,9- <file>
cut --complement -c 6-8 <file>

Just for fun, perl, where you can assign to a substring
perl -pe 'substr($_,5,3)=""' file

With awk :
echo "18:40:12,172.16.70.217,UP" | awk '{ $0 = ( substr($0,1,5) substr($0,9) ) ; print $0}'
Regards!

If you are running on bash, you can use the string manipulation functionality of it instead of having to call awk, sed, cut or whatever binary:
while read STRING
do
echo ${STRING:0:5}${STRING:9}
done < myfile.txt
${STRING:0:5} represents the first five characters of your string, ${STRING:9} represents the 9th character and all remaining characters until the end of the line. This way you cut out characters 6,7 and 8 ...

Replacing/removing excess white space between columns in a file

I am trying to parse a file with similar contents:
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
I want the out file to be tab delimited:
I am a string\t12831928
I am another string\t41327318
A set of strings\t39842938
Another string\t3242342
I have tried the following:
sed 's/\s+/\t/g' filename > outfile
I have also tried cut, and awk.

Just use awk:
$ awk -F' +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Breakdown:
-F' +' # tell awk that input fields (FS) are separated by 2 or more blanks
-v OFS='\t' # tell awk that output fields are separated by tabs
'{sub(/ +$/,""); # remove all trailing blank spaces from the current record (line)
$1=$1} # recompile the current record (line) replacing FSs by OFSs
1' # idiomatic: any true condition invokes the default action of "print"
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

The difficulty comes in the varying number of words per-line. While you can handle this with awk, a simple script reading each word in a line into an array and then tab-delimiting the last word in each line will work as well:
#!/bin/bash
fn="${1:-/dev/stdin}"
while read -r line || test -n "$line"; do
arr=( $(echo "$line") )
nword=${#arr[#]}
for ((i = 0; i < nword - 1; i++)); do
test "$i" -eq '0' && word="${arr[i]}" || word=" ${arr[i]}"
printf "%s" "$word"
done
printf "\t%s\n" "${arr[i]}"
done < "$fn"
Example Use/Output
(using your input file)
$ bash rfmttab.sh < dat/tabfile.txt
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Each number is tab-delimited from the rest of the string. Look it over and let me know if you have any questions.

sed -E 's/[ ][ ]+/\\t/g' filename > outfile
NOTE: the [ ] is openBracket Space closeBracket
-E for extended regular expression support.
The double brackets [ ][ ]+ is to only substitute tabs for more than 1 consecutive space.
Tested on MacOS and Ubuntu versions of sed.

Your input has spaces at the end of each line, which makes things a little more difficult than without. This sed command would replace the spaces before that last column with a tab:
$ sed 's/[[:blank:]]*\([^[:blank:]]*[[:blank:]]*\)$/\t\1/' infile | cat -A
I am a string^I12831928 $
I am another string^I41327318 $
A set of strings^I39842938 $
Another string^I3242342 $
This matches – anchored at the end of the line – blanks, non-blanks and again blanks, zero or more of each. The last column and the optional blanks after it are captured.
The blanks before the last column are then replaced by a single tab, and the rest stays the same – see output piped to cat -A to show explicit line endings and ^I for tab characters.
If there are no blanks at the end of each line, this simplifies to
sed 's/[[:blank:]]*\([^[:blank:]]*\)$/\t\1/' infile
Notice that some seds, notably BSD sed as found in MacOS, can't use \t for tab in a substitution. In that case, you have to use either '$'\t'' or '"$(printf '\t')"' instead.

another approach, with gnu sed and rev
$ rev file | sed -r 's/ +/\t/1' | rev

You have trailing spaces on each line. So you can do two sed expressions in one go like so:
$ sed -E -e 's/ +$//' -e $'s/ +/\t/' /tmp/file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Note the $'s/ +/\t/': This tells bash to replace \t with an actual tab character prior to invoking sed.
To show that these deletions and \t insertions are in the right place you can do:
$ sed -E -e 's/ +$/X/' -e $'s/ +/Y/' /tmp/file
I am a stringY12831928X
I am another stringY41327318X
A set of stringsY39842938X
Another stringY3242342X

Simple and without invisible semantic characters in the code:
perl -lpe 's/\s+$//; s/\s\s+/\t/' filename
Explanation:
Options:
-l: remove LF during processing (in this case)
-p: loop over records (like awk) and print
-e: code follows
Code:
remove trailing whitespace
change two or more whitespace to tab
Tested on OP data. The trailing spaces are removed for consistency.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using sed with delimiters similar to cut - bash

Related

Bash; Replacing new line with ", " and ending with ".", can someone explain awk and sed, please?

How to delete a line of the text file from the output of checklist

Grep to exclude comment like # and -- with trailing spaces and within line

Bash + sed/awk/cut to delete nth character

Replacing/removing excess white space between columns in a file

Categories

Resources