How to delete all text on a line appearing after a particular symbol? - ruby

I have a file, file1.txt, like this:
This is some text.
This is some more text. ② This is a note.
This is yet some more text.
I need to delete any text appearing after "②", including the "②" and any single space appearing immediately before, if such a space is present. E.g., the above file would become file2.txt:
This is some text.
This is some more text.
This is yet some more text.
How can I delete the "②", anything coming after, and any preceding single space?
The solutions at How can I remove all text after a character in bash? do not seem to work, perhaps because "②" is not an ordinary character.
The file is saved in UTF-8.

A Perl solution:
$ perl -CS -i~ -p -E's/ ②.*//' file1.txt
You'll end up with the correct data in file1.txt and a backup of the original file in file1.txt~.

I hope you do realize most unix utilities do not work with unicode. I assume your input is in UTF-8, if not you have to adjust accordingly.
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
echo -e $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out="$out $line"
echo -e $out
out=''
else
out="$out $line"
fi
done
if [ "$out" != '' ] ; then
echo -e $out
fi
fi |
(perl -pe 's/( 0020)* 2461 .*$/ 000a/;s/ *//g') |
while read line
do
px $line
done | (iconv -f UTF16 -t UTF8 )

sed -e "s/[[:space:]]②[^\.]*\.//"
However, I am not sure that the ② symbol is parsed correctly. Maybe you have to use UTF8 codes or something like.

Try this:
sed -e '/②/ s/[ ]*②.*$//'
/②/ look only for the lines containing the magic symbol;
[ ]* for any number (matches none) of spaces before the magic symbol;
.*$ everything else till the end of line.

Related

combining all files that contains the same word into a new text file with leaving new lines between individual files

it is my first question here. I have a folder called "materials", which has 40 text files in it. I am basically trying to combine the text files that contain the word "carbon"(both in capitalized and lowercase form)in it into a single file with leaving newlines between them. I used " grep -w carbon * " to identify the files that contain the word carbon. I just don't know what to do after this point. I really appreciate all your help!
grep -il carbon materials/*txt | while read line; do
echo ">> Adding $line";
cat $line >> result.out;
echo >> result.out;
done
Explanation
grep searches the strings in the files. -i ignores the case for the searched string. -l prints on the filename containing the string
while command loops over the files containing the string
cat with >> appends to the results.out
echo >> adds new line after appending each files content to result.out
Execution
$ ls -1 materials/*.txt
materials/1.txt
materials/2.txt
materials/3.txt
$ grep -i carbon materials/*.txt
materials/1.txt:carbon
materials/2.txt:CARBON
$ grep -irl carbon materials/*txt | while read line; do echo ">> Adding $line"; cat $line >> result.out; echo >> result.out; done
>> Adding materials/1.txt
>> Adding materials/2.txt
$ cat result.out
carbon
CARBON
Try this (assuming your filenames don't contain newline characters):
grep -iwl carbon ./* |
while IFS= read -r f; do cat "$f"; echo; done > /tmp/combined
If it is possible that your filenames may contain newline characters and your shell is bash, then:
grep -iwlZ carbon ./* |
while IFS= read -r -d '' f; do cat "$f"; echo; done > /tmp/combined
grep is assumed to be GNU grep (for the -w and -Z options). Note that these will leave a trailing newline character in the file /tmp/combined.

Using cut on stdout with tabs

I have a file which contains one line of text with tabs
echo -e "foo\tbar\tfoo2\nx\ty\tz" > file.txt
I'd like to get the first column with cut. It works if I do
$ cut -f 1 file.txt
foo
x
But if I read it in a bash script
while read line
do
new_name=`echo -e $line | cut -f 1`
echo -e "$new_name"
done < file.txt
Then I get instead
foo bar foo2
x y z
What am I doing wrong?
/edit: My script looks like that right now
while IFS=$'\t' read word definition
do
clean_word=`echo -e $word | external-command'`
echo -e "$clean_word\t<b>$word</b><br>$definition" >> $2
done < $1
External command removes diacritics from a Greek word. Can the script be optimized any further without changing external-command?
What is happening is that you did not quote $line when reading the file. Then, the original tab-delimited format was lost and instead of tabs, spaces show in between words. And since cut's default delimiter is a TAB, it does not find any and it prints the whole line.
So quoting works:
while read line
do
new_name=`echo -e "$line" | cut -f 1`
#----------------^^^^^^^
echo -e "$new_name"
done < file.txt
Note, however, that you could have used IFS to set the tab as field separator and read more than one parameter at a time:
while IFS=$'\t' read name rest;
do
echo "$name"
done < file.txt
returning:
foo
x
And, again, note that awk is even faster for this purpose:
$ awk -F"\t" '{print $1}' file.txt
foo
x
So, unless you want to call some external command while looping the file, awk (or sed) is better.

Speed up bash filter function to run commands consecutively instead of per line

I have written the following filter as a function in my ~/.bash_profile:
hilite() {
export REGEX_SED=$(echo $1 | sed "s/[|()]/\\\&/g")
while read line
do
echo $line | egrep "$1" | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
done
exit 0
}
to find lines of anything piped into it matching a regular expression, and highlight matches using ANSI escape codes on a VT100-compatible terminal.
For example, the following finds and highlights the strings bin, U or 1 which are whole words in the last 10 lines of /etc/passwd:
tail /etc/passwd | hilite "\b(bin|[U1])\b"
However, the script runs very slowly as each line forks an echo, egrep and sed.
In this case, it would be more efficient to do egrep on the entire input, and then run sed on its output.
How can I modify my function to do this? I would prefer to not create any temporary files if possible.
P.S. Is there another way to find and highlight lines in a similar way?
sed can do a bit of grepping itself: if you give it the -n flag (or #n instruction in a script) it won't echo any output unless asked. So
while read line
do
echo $line | egrep "$1" | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
done
could be simplified to
sed -n "s/$REGEX_SED/\x1b[7m&\x1b[0m/gp"
EDIT:
Here's the whole function:
hilite() {
REGEX_SED=$(echo $1 | sed "s/[|()]/\\\&/g");
sed -n "s/$REGEX_SED/\x1b[7m&\x1b[0m/gp"
}
That's all there is to it - no while loop, reading, grepping, etc.
If your egrep supports --color, just put this in .bash_profile:
hilite() { command egrep --color=auto "$#"; }
(Personally, I would name the function egrep; hence the usage of command).
I think you can replace the whole while loop with simply
sed -n "s/$REGEX_SED/\x1b[7m&\x1b[0m/gp"
because sed can read from stdin line-by-line so you don't need read
I'm not sure if running egrep and piping to sed is faster than using sed alone, but you can always compare using time.
Edit: added -n and p to sed to print only highlighted lines.
Well, you could simply do this:
egrep "$1" $line | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
But I'm not sure that it'll be that much faster ; )
Just for the record, this is a method using a temporary file:
hilite() {
export REGEX_SED=$(echo $1 | sed "s/[|()]/\\\&/g")
export FILE=$2
if [ -z "$FILE" ]
then
export FILE=~/tmp
echo -n > $FILE
while read line
do
echo $line >> $FILE
done
fi
egrep "$1" $FILE | sed "s/$REGEX_SED/\x1b[7m&\x1b[0m/g"
return $?
}
which also takes a file/pathname as the second argument, for case like
cat /etc/passwd | hilite "\b(bin|[U1])\b"

Bash script get item from array

I'm trying to read file line by line in bash.
Every line has format as follows text|number.
I want to produce file with format as follows text,text,text etc. so new file would have just text from previous file separated by comma.
Here is what I've tried and couldn't get it to work :
FILENAME=$1
OLD_IFS=$IFSddd
IFS=$'\n'
i=0
for line in $(cat "$FILENAME"); do
array=(`echo $line | sed -e 's/|/,/g'`)
echo ${array[0]}
i=i+1;
done
IFS=$OLD_IFS
But this prints both text and number but in different format text number
here is sample input :
dsadadq-2321dsad-dasdas|4212
dsadadq-2321dsad-d22as|4322
here is sample output:
dsadadq-2321dsad-dasdas,dsadadq-2321dsad-d22as
What did I do wrong?
Not pure bash, but you could do this in awk:
awk -F'|' 'NR>1{printf(",")} {printf("%s",$1)}'
Alternately, in pure bash and without having to strip the final comma:
#/bin/bash
# You can get your input from somewhere else if you like. Even stdin to the script.
input=$'dsadadq-2321dsad-dasdas|4212\ndsadadq-2321dsad-d22as|4322\n'
# Output should be reset to empty, for safety.
output=""
# Step through our input. (I don't know your column names.)
while IFS='|' read left right; do
# Only add a field if it exists. Salt to taste.
if [[ -n "$left" ]]; then
# Append data to output string
output="${output:+$output,}$left"
fi
done <<< "$input"
echo "$output"
No need for arrays and sed:
while IFS='' read line ; do
echo -n "${line%|*}",
done < "$FILENAME"
You just have to remove the last comma :-)
Using sed:
$ sed ':a;N;$!ba;s/|[0-9]*\n*/,/g;s/,$//' file
dsadadq-2321dsad-dasdas,dsadadq-2321dsad-d22as
Alternatively, here is a bit more readable sed with tr:
$ sed 's/|.*$/,/g' file | tr -d '\n' | sed 's/,$//'
dsadadq-2321dsad-dasdas,dsadadq-2321dsad-d22as
Choroba has the best answer (imho) except that it does not handle blank lines and it adds a trailing comma. Also, mucking with IFS is unnecessary.
This is a modification of his answer that solves those problems:
while read line ; do
if [ -n "$line" ]; then
if [ -n "$afterfirst" ]; then echo -n ,; fi
afterfirst=1
echo -n "${line%|*}"
fi
done < "$FILENAME"
The first if is just to filter out blank lines. The second if and the $afterfirst stuff is just to prevent the extra comma. It echos a comma before every entry except the first one. ${line%|\*} is a bash parameter notation that deletes the end of a paramerter if it matches some expression. line is the paramter, % is the symbol that indicates a trailing pattern should be deleted, and |* is the pattern to delete.

How to delete all lines containing more than three characters in the second column of a CSV file?

How can I delete all of the lines in a CSV file which contain more than 3 characters in the second column? E.g.:
cave,ape,1
tree,monkey,2
The second line contains more than 3 characters in the second column, so it will be deleted.
awk -F, 'length($2)<=3' input.txt
You can use this command:
grep -vE "^[^,]+,[^,]{4,}," test.csv > filtered.csv
Breakdown of the grep syntax:
-v = remove lines matching
-E = extended regular expression syntax (also -P is perl syntax)
bash stuff:
> filename = overwrite/create a file and fill it with the standard out
Breakdown of the regex syntax:
"^[^,]+,[^,]{4,},"
^ = beginning of line
[^,] = anything except commas
[^,]+ = 1 or more of anything except commas
, = comma
[^,]{4,} = 4 or more of anything except commas
And please note that the above is simplified and would not work if the first 2 columns contained commas in the data. (it does not know the difference between escaped commas and raw ones)
No one has supplied a sed answer yet, so here it is:
sed -e '/^[^,]*,[^,]\{4\}/d' animal.csv
And here's some test data.
>animal.csv cat <<'.'
cave,ape,0
,cat,1
,orangutan,2
large,wolf,3
,dog,4,happy
tree,monkey,5,sad
.
And now to test:
sed -i'' -e '/^[^,]*,[^,]\{4\}/d' animal.csv
cat animal.csv
Only ape, cat and dog should appear in the output.
This is a filter script for your type of data. It assumes your data is in utf8
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
px $utf16header
cnt=0
out=''
st=0
while read line
do
if [ "$st" -eq 1 ] ; then
cnt=$(($cnt+1))
fi
if [ "$line" == "002c" ] ; then
st=$(($st+1))
fi
if [ "$line" == "000a" ]
then
out=$out$line
if [[ $cnt -le 3+1 ]] ; then
px $out
fi
cnt=0
out=''
st=0
else
out=$out$line
fi
done
fi | iconv -f UTF16 -t UTF8

Resources