Sed to remove substring | Can I make a flexible pattern to remove numbers before tab? - bash

I wanted to ask some advice on an issue that I'm having in removing a substring from a string. I have a file with many lines like the following:
DOG; CSQ| 0.1234 | abcd | \t CAT
where \t represents a literal tab.
My aim is to remove a substring by using sed 's/CSQ.*|//g' so that I can get the following output:
DOG; CAT
However I face a problem where all the rows aren't formatted the same. For example, I also get lines such as:
DOG; CSQ| 0.1234 | abcd | 0 \t CAT
DOG; CSQ| 0.1234 | abcd | 0.9187 \t CAT
My code fails at this point because instead of getting DOG; CAT for all lines, I get:
DOG; CAT
DOG; 0 CAT
DOG; 0.9187 CAT
I've searched for possible solutions but I'm having difficulty (I'm also quite new to bash). I imagine there's something that I can do with sed that will handles all cases but I'm not sure.

You can find and replace all text from CSQ till the last | and all chars after that till the tab including it using
sed 's/CSQ.*|.*\t//' file > newfile
See the online demo.
The CSQ.*|.*\t is a POSIX BRE pattern that matches
CSQ - a CSQ string
.* - any text
| - a pipe char
.* - any text
\t - TAB char.
If the \t are two-char combinations double the backslash before t:
sed 's/CSQ.*|.*\\t//' file > newfile
See this online demo.

So optionally match it.
sed 's/CSQ.*|\( [0-9.]*\)\?//g'
You can learn regex online with fun with regex crosswords.

awk makes this pretty easy.
$: awk '/CSQ.*\t/{print $1" "$NF}' file
DOG; CAT
DOG; CAT
DOG; CAT
Note that the file has to have actual tabs, not \t sequences. awk will read the \t correctly.
If there are no other formatted lines in the file that you want, then maybe just
$: awk '{print $1" "$NF}' file
DOG; CAT
DOG; CAT
DOG; CAT

Related

How to delete a line of the text file from the output of checklist

I have a text file:
$100 Birthday
$500 Laptop
$50 Phone
I created a --checklist from the text file
[ ] $100 Birthday
[*] $500 Laptop
[*] $50 Phone
the output is $100 $50
How can I delete the line of $100 and $50 in the text file, please?
The expected output of text file:
$100 Birthday
Thank you!
with grep and cut
grep -xf <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
with grep and sed
grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
explanation
use grep to select lines from text file
$ grep Birthday file1.txt
100 Birthday
cut will split line into columns. -f 2 will print only 2nd column but -f 2- will print everything from 2nd column. as delimiter -d whitespace ' ' is used here (some character must escaped with \)
and we can use pipe | as input (instead file)
$ echo one two three | cut -d \ -f 2-
two three
$ grep Birthday file1.txt | cut -d \ -f 2-
Birthday ^
|
(note the two whitespaces) --------+
assuming we have a text file temp.txt
$ cat temp.txt
Birthday
Laptop
Phone
grep can also read list of search patterns from another file as input instead
$ grep -f temp.txt file1.txt
100 Birthday
500 Laptop
50 Phone
or we print the file content with cat and redirect output with <
$ grep -f <(cat temp.txt) file1.txt
100 Birthday
500 Laptop
50 Phone
Now let's generate temp.txt from checklist. You only want grep lines containing [ ] and cut starting from 3rd column (again some characters have special meaning and must therefore escaped \[)
$ grep '\[ ]' file2.txt
[ ] 100 Birthday
$ grep '\[ ]' file2.txt | cut -d\ -f3-
100 Birthday
You don't need temp.txt and can therefore redirect list straight to grep -f what is called process substitution <(...)
$ grep -f <(grep '\[ ]' file2.txt | cut -d\ -f3-) file1.txt
100 Birthday
grep read all lines from temp.txt as PATTERN and some characters have special meaning for regex. ^ stands for begin of line and $ for end of line. To be nitpicky correct search pattern should therefore be '^100 Birthday$' so it won't match "1100 Birthday 2".
You might have noticed that I dropped the $ currency in your input files for reason. You can keep it, but tell grep to take all input literally with -F and(/or?) -x flag which will search for whole line "100 Birthday" (no regex for line start/ending needed)
sed [OPTION] 's/regexp/replacement/command' [file]
sed is more common when it comes to text editing. instead grep | cut we can do it from one single command:
grep '\[ ]' | cut -f3- and sed 's/\[ ] *//'
are basically targeting the same lines and delete [ ] from it.
There are however some extra flags required, because sed is text editor and will stream the whole file by default. to emulate grep's behavior we use
-n option to suppress the input
p command to print only changes
and for regexp
\[ ] (text to replace)
' *' = ' ' (whitespace) + * (star)
meaning: repeated previous character 0 or more times, in particulary all trailing whitespaces
(replacement is empty because we want just delete)
so working similar sed command will look like this
sed -n 's/\[ ] *//p' file2.txt
And that's in my opinion all it takes for a checklist. You have however two redundant files and want match your cloned checklist against original file, so let me show you more complicated things.
Instead of deleting the checkbox let's output captured groups. This pseudo code will explain it better than me. \1 is for first capture group ( ) and so on (kinda internal variables)
$ sed 's/(aaa)b(ccc)dd/\1/'
aaa
$ sed 's/(aaa)b(ccc)dd/\2/'
ccc
$ sed 's/(aaa)b(ccc)dd/\1 \2/'
aaa ccc
$ sed 's/(aaa)b(ccc)dd/lets \1 replace \2 this/'
lets aaa replace ccc this
so in this example sed 's/\[ ] (.*)/\1/' we use for regexp
\[ ] (text to replace)
' ' (trailing whitespace)
and inside the first capture group ( ) the desired "100 Birthday"
.* = . (dot) + * (star)
meaning: repeated previous character 0 or more times (in particulary a dot here)
but the dot . itself is regex for ANY char now (special meaning)
so the capture group is all the rest of line
and for replacement we use (only)
\1 first capture group
$ sed -n 's/\[ ] (.*)/\1/p' file2.txt
100 Birthday
But there is more :)
Instead of matching only ' ' whitespace there exist another regex with special meaning (extended regex)
\s will match whitespace and tab
+ repeated previous character 1 or more times (note the difference to * 0 or more times)
\s+ will match a series of spaces
and to make it work we need one more flag
-r use extended regular expressions
so with this command you can extract all search patterns from your cloned checklist...
$ sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt
100 Birthday
...and finally let it run against your original file (without the need of temp.txt)
$ grep -xf <(sed -rn 's/\[ ]\s+(.*)/\1/p' file2.txt) file1.txt
100 Birthday

How to insert a generated value by a loop while you open a file in bash

Lets say that I have:
cat FILENAME1.txt
Definition john
cat FILENAME2.txt
Definition mary
cat FILENAME3.txt
Definition gary
cat textfile.edited
text
text
text
I want to obtain an ouput like:
1 john text
2 mary text
3 gary text
I tried to use "stored" values from FILENAMES "generated" by a loop. I wrote this:
for file in $(ls *.txt); do
name=$(cat $file| grep -i Definition|awk '{$1="";print $0}')
#echo $name --> this command works as it gives the names
done
cat textfile.edited| awk '{printf "%s\t%s\n",NR,$0}'
which very close to what I want to get
1 text
2 text
3 text
My issue was coming through when I tried to add the "stored" value. I tried the following with no success.
cat textfile.edited| awk '{printf "%s\t%s\n",$name,NR,$0}'
cat textfile.edited| awk '{printf "%s\t%s\n",name,NR,$0}'
cat textfile.edited| awk -v name=$name '{printf "%s\t%s\n",NR,$0}'
Sorry if the terminology used is not the best, but I started scripting recently.
Thank you in advance!!!
One solution using paste and awk ...
We'll append a count to the lines in textfile.edited (so we can see which lines are matched by paste):
$ cat textfile.edited
text1
text2
text3
First we'll look at the paste component:
$ paste <(egrep -hi Definition FILENAME*.txt) textfile.edited
Definition john text1
Definition mary text2
Definition gary text3
From here awk can do the final slicing-n-dicing-n-numbering:
$ paste <(egrep -hi Definition FILENAME*.txt) textfile.edited | awk 'BEGIN {OFS="\t"} {print NR,$2,$3}'
1 john text1
2 mary text2
3 gary text3
NOTE: It's not clear (to me) if the requirement is for a space or tab between the 2nd and 3rd columns; above solution assumes a tab, while using a space would be doable via a (awk) printf call.
You can do all with one awk command.
First file is the textfile.edited, other files are mentioned last.
awk 'NR==FNR {text[NR]=$0;next}
/^Definition/ {namenr++; names[namenr]=$2}
END { for (i=1;i<=namenr;i++) printf("%s %s %s\n", i, names[i], text[i]);}
' textfile.edited FILENAME*.txt
You can avoid awk with
paste -d' ' <(seq $(wc -l <textfile.edited)) \
<(sed -n 's/^Definition //p' FILE*) \
textfile.edited
Another version of the paste solution with a slightly careless grep -
$: paste -d\ <( grep -ho '[^ ]*$' FILENAME?.txt ) textfile.edited
john text
mary text
gary text
Or, one more way to look at it...
$: a=( $(sed '/^Definition /s/.* //;' FILENAME[123].txt) )
$: echo "${a[#]}"
john mary gary
$: b=( $(<textfile.edited) )
$: echo "${b[#]}"
text text text
$: c=-1 # initialize so that the first pre-increment returns 0
$: while [[ -n "${a[++c]}" ]]; do echo "${a[c]} ${b[c]}"; done
john text
mary text
gary text
This will put all the values in memory before printing anything, so if the lists are really large it might not be your best bet. If they are fairly small, it's pretty efficient, and a single parallel index will keep them in order.
If the lines are not the same as the number of files, what did you want to do? As long as there aren't more files than lines, and any extra lines are ok to ignore, this still works. If there are more files than lines, then we need to know how you'd prefer to handle that.
A one-liner using GNU utilities:
paste -d ' ' <(cat -n FILENAME*.txt | sed 's/\sDefinition//') textfile.edited
Or,
paste -d ' ' <(cat -n FILENAME*.txt | sed 's/^\s*//;s/\sDefinition//') textfile.edited
if the leading white spaces are not desired.
Alternatively:
paste -d ' ' <(sed 's/^Definition\s//' FILENAME*.txt | cat -n) textfile.edited

sed | awk : Keep end of String until special character is reached

I'm trying to cut a HDD ID's in sed to just contain the serial number of the drive. The ID's looks like:
t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116
So, I only want to keep the "WD2DWMC4N2575116". Serial numbers are not fixed length so I tried to keep the last character until the first "_" appears. Unfortunately I suck at RegExp :(
To capture all characters after last _, using backreference:
$ sed 's/.*_\(.*\)/\1/' <<< "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116"
WD2DWMC4N2575116
Or as pointed out in comment, you can just remove all characters from beginning of the line up to last _:
sed 's/.*_//' file
echo "t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116" | rev | awk -F '_' '{print $1}' | rev
It works only if the ID is at the end.
Another in awk, this time using sub:
Data:
$ cat file
t10.ATA_____WDC_WD30EFRX2D68EUZN0_________________________WD2DWMC4N2575116
Code + result:
$ awk 'sub(/^.*_/,"")' file
WD2DWMC4N2575116
ie. replace everything from the first character to the last _. As sub returns the number of substitutions made, that value is used to trigger the implicit output. If you have several records to process and not all of them have _s, add ||1 after the sub:
$ cat foo >> file
$ awk 'sub(/^.*_/,"") || 1' file
WD2DWMC4N2575116
foo

BASH: grep characters and replace by the same plus tab

Basically, the only thing I need is to replace two spaces by a tab; this is the query:
abc def ghi K00001 jkl
all the columns are separated by a tab; the K00001 jkl is separated by two spaces. But I want these two spaces to be replaced by a tab.
I cannot just grep all two spaces since other contents have to spaces and they should stay.
My approach would be to grep:
grep '[0-9][0-9][0-9][0-9][0-9] ' file
but I want to replace it to have the same K00001<TAB>jkl
How do I replace by the same string? Can I use variables to store the grep result and then print the modified (tab not spaces) by the same string?
sed -r "s/([A-Z][0-9]{5}) /&\t/" File
or
sed -r "s/([A-Z][0-9]{5})\s{2}/&\t/" File
Example :
AMD$ echo "abc def ghi K00001 jkl" | sed -r "s/([A-Z][0-9]{5}) /&\t/"
abc def ghi K00001 jkl
You can use this sed:
sed -E $'s/([^[:blank:]]) {2}([^[:blank:]])/\\1\t\\2/g' file
Regex ([^[:blank:]]) {2}([^[:blank:]]) makes sure to match 2 spaces surrounded by 2 non-space characters. In replacement we put back surrounding characters using back-references \1 and \2
I would use awk , since with awk no matter if fields are separated by one - two or more spaces i can force output to be with tabs:
$ echo "abc def ghi K00001 jkl" |awk -v OFS="\t" '{$1=$1}1'
abc def ghi K00001 jkl

Search first occurrence and print until next delimiter, but match whole word only

I have a file with multiple lines of text similar to:
foo
1
2
3
bar
fool
1
2
3
bar
food
1
2
3
bar
So far the following gives me a closer answer:
sed -n '/foo/,/bar/ p' file.txt | sed -e '$d'
...but it fails by introducing duplicates if it encounters words like "food" or "fool". I want to make the code above do a whole word match only (i.e. grep -w), but inserting the \b switch doesn't seem to work:
sed -n '/foo/\b,/bar/ p' file.txt | sed -e '$d'
I would like to print anything after "foo" (including the first foo) up until "bar", but matching only "foo", and not "foo1".
Use the Regex tokens ^ and $ to indicate the start and end of a line respectively:
sed -n '/^foo$/,/^bar$/ p' file.txt
sed -n '/\<foo\>/,/\<bar\>/ p' file.txt
Or may be this if foo and bar have to be first word of any line.
sed -n '/^\<foo\>/,/^\<bar\>/ p' file

Resources