How to delete all lines containing more than three characters in the second column of a CSV file? - bash

How can I delete all of the lines in a CSV file which contain more than 3 characters in the second column? E.g.:
cave,ape,1
tree,monkey,2
The second line contains more than 3 characters in the second column, so it will be deleted.

awk -F, 'length($2)<=3' input.txt

You can use this command:
grep -vE "^[^,]+,[^,]{4,}," test.csv > filtered.csv
Breakdown of the grep syntax:
-v = remove lines matching
-E = extended regular expression syntax (also -P is perl syntax)
bash stuff:
> filename = overwrite/create a file and fill it with the standard out
Breakdown of the regex syntax:
"^[^,]+,[^,]{4,},"
^ = beginning of line
[^,] = anything except commas
[^,]+ = 1 or more of anything except commas
, = comma
[^,]{4,} = 4 or more of anything except commas
And please note that the above is simplified and would not work if the first 2 columns contained commas in the data. (it does not know the difference between escaped commas and raw ones)

No one has supplied a sed answer yet, so here it is:
sed -e '/^[^,]*,[^,]\{4\}/d' animal.csv
And here's some test data.
>animal.csv cat <<'.'
cave,ape,0
,cat,1
,orangutan,2
large,wolf,3
,dog,4,happy
tree,monkey,5,sad
.
And now to test:
sed -i'' -e '/^[^,]*,[^,]\{4\}/d' animal.csv
cat animal.csv
Only ape, cat and dog should appear in the output.

This is a filter script for your type of data. It assumes your data is in utf8
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
px $utf16header
cnt=0
out=''
st=0
while read line
do
if [ "$st" -eq 1 ] ; then
cnt=$(($cnt+1))
fi
if [ "$line" == "002c" ] ; then
st=$(($st+1))
fi
if [ "$line" == "000a" ]
then
out=$out$line
if [[ $cnt -le 3+1 ]] ; then
px $out
fi
cnt=0
out=''
st=0
else
out=$out$line
fi
done
fi | iconv -f UTF16 -t UTF8

Related

Accept filename as argument and calculate repeated words along with count

I need to find the number or repeated characters from a text file and need to pass filename as argument.
Example:
test.txt data contains
Zoom
Output should be like:
z 1
o 2
m 1
I need a command that will accept filename as argument and then lists the number of characters from that file. In my example I have a test.txt which has zoom word. So the output will be like how many times each letter has repeated.
My attempt:
vi test.sh
#!/bin/bash
FILE="$1" --to pass filename as argument
sort file1.txt | uniq -c --to count the number of letters
Just a guess?
cat test.txt |
tr '[:upper:]' '[:lower:]' |
fold -w 1 |
sort |
uniq -c |
awk '{print $2, $1}'
m 1
o 2
z 1
Suggesting awk script that count all kinds of chars:
awk '
BEGIN{FS = ""} # make each char a field
{
for (i = 1; i <= NF; i++) { # iteratre over all fields in line
++charsArr[$i]; # count each field occourance in array
}
}
END {
for (char in charsArr) { # iterrate over chars array
printf("%3d %s\n", charsArr[char], char); # cournt char-occourances and the char
}
}' |sort -n
Or in one line:
awk '{for(i=1;i<=NF;i++)++arr[$i]}END{for(char in arr)printf("%3d %s\n",arr[char],char)}' FS="" input.1.txt|sort -n
#!/bin/bash
#get the argument for further processing
inputfile="$1"
#check if file exists
if [ -f $inputfile ]
then
#convert file to a usable format
#convert all characters to lowercase
#put each character on a new line
#output to temporary file
cat $inputfile | tr '[:upper:]' '[:lower:]' | sed -e 's/\(.\)/\1\n/g' > tmp.txt
#loop over every character from a-z
for char in {a..z}
do
#count how many times a character occurs
count=$(grep -c "$char" tmp.txt)
#print if count > 0
if [ "$count" -gt "0" ]
then
echo -e "$char" "$count"
fi
done
rm tmp.txt
else
echo "file not found!"
exit 1
fi

Hex to decimal conversion in bash without using gawk

Input:
cat test1.out
12 , maze|style=0x48570006, column area #=0x7, location=0x80000d
13 , maze|style=0x48570005, column area #=0x7, location=0x80aa0d
....
...
..
.
Output needed:
12 , maze|style=0x48570006, column area #=0x7, location=8388621 <<<8388621 is decimal of 0x80000d
....
I want to convert just the last column to decimal.
I cannot use gawk as it is not available in our company machines everywhere.
Tried using awk --non-decimal-data but it didnt work also.
Wondering if just printf command can work on flipping the last word from hex to decimal.
Any other ideas that you can suggest?
There's no need for awk or any other external commands here: bash's native math operation handle hexadecimal values correctly when in an arithmetic context (this is why echo $((0xff)) emits 255).
#!/usr/bin/env bash
# ^^^^- must be really bash, not /bin/sh
location_re='location=(0x[[:xdigit:]]+)([[:space:]]|$)'
while read -r line; do
if [[ $line =~ $location_re ]]; then
hex=${BASH_REMATCH[1]}
dec=$(( $hex ))
printf '%s\n' "${line/location=$hex/location=$dec}"
else
printf '%s\n' "$line"
fi
done
You can see this running at https://ideone.com/uN7qNY
Considering the case strtonum() function is not available, how about:
#!/bin/bash
awk -F'location=0x' '
function hex2dec(str,
i, x, c, tab) {
for (i = 0; i <= 15; i++) {
tab[substr("0123456789ABCDEF", i + 1, 1)] = i;
}
x = 0
for (i = 1; i <= length(str); i++) {
c = toupper(substr(str, i, 1))
x = x * 16 + tab[c]
}
return x
}
{
print $1 "location=" hex2dec($2)
}
' test1.out
where hex2dec() is a homemade substituion of strtonum().
Wait, can't you just use printf in other awks? It won't work with gawk but it does with other awks, right? For example with mawk:
$ mawk 'BEGIN{FS=OFS="="}{$NF=sprintf("%d", $NF);print}' file
12 , maze|style=0x48570006, column area #=0x7, location=8388621
13 , maze|style=0x48570005, column area #=0x7, location=8432141
I tested with mawk, awk-20070501, awk-20121220 and Busybox awk.
Discarded after edit but left for comments' sake:
Using rev and cut to extract around the last = and printf for hex2dec conversion:
$ while IFS='' read -r line || [[ -n "$line" ]]
do
printf "%s=%d\n" "$(echo "$line" | rev | cut -d = -f 2- | rev)" \
$(echo "$line" | rev | cut -d = -f 1 | rev)
done < file
Output:
12 , maze|style=0x48570006, column area #=0x7, location=8388621
13 , maze|style=0x48570005, column area #=0x7, location=8432141
If you have Perl installed, not having Gawk is rather inconsequential.
perl -pe 's/location=\K0x([0-9a-f]+)/ hex($1) /e' file
This might work for you (GNU sed and Bash):
sed 's/\(.*location=\)\(0x[0-9a-f]\+\)/echo "\1$((\2))"/Ie' file
Use pattern matching and back references to split each line and then evaluate an echo command.
Alternative:
sed 's/\(.*location=\)\(0x[0-9a-f]\+\)/echo "\1$((\2))"/I' file | sh
BASH_REMATCH array info :
http://molk.ch/tips/gnu/bash/rematch.html
quintessential principe :
[[ string =~ regexp ]]
[[ "abcdef" =~ (b)(.)(d)e ]]
If the 'string' matches 'regexp',
.. the matched part of the string is stored in the BASH_REMATCH array.
# Now:
# BASH_REMATCH[0]=bcde # as the total match
# BASH_REMATCH[1]=b # as the 1'th captured group
# BASH_REMATCH[2]=c # as ...
# BASH_REMATCH[3]=d
enjoy !
Bash's native math operation handles hexadecimal values correctly anytime.
Example:
echo $(( 0xff))
255
printf '%d' 0xf0
240

Removing current line of a file

I'm facing something that looks easy, but can't find the answer :
The goal of this function is to remove all the line that contains 3 commas ',' :
while read line; do
COUNT=$(echo $line | grep -o "\," | wc -)
if [ $COUNT -ne 3 ]; then
remove line
fi
done < tmp.txt
I dont find how to remove current line, can you help me ?
I extract this tmp.txt from a larger with grep, if it was in a variable instead of a tmp.txt will it be the same ?
while read line; do
COUNT=$(echo $line | grep -o "\," | wc -)
COUNT=$(echo $line | grep -o "\," | wc -)
if [ $COUNT -ne 3 ]; then
remove line
fi
done <<< "$toto"
Thanks in advance
Using sed command only solution.
sed '/^\([^,]*,\)\{3\}[^,]*$/d' infile
Delete all those line which character comma , occurred exactly 3 times.
Or using awk:
awk -F, 'NF!=4' infile
Or both read from a variable.
sed '/^\([^,]*,\)\{3\}[^,]*$/d' <<<"$variable"
awk -F, 'NF!=4' <<<"$variable"
A simple awk solution
awk 'gsub(/,/,",")!=3' file
gsub replaces the pattern with the specified string and it returns the number of substitutions/replacements made.
We are replacing , with , here and thus gsub will return us the number of , in the string.
Example :
Input file
hello this line has 1 ,
This line, has, 3 ,
This line, has, 4 , commas , Thanks
Output
$ awk 'gsub(/,/,",")!=3' file
hello this line has 1 ,
This line, has, 4 , commas , Thanks
I would have done it in the other way :
while read line; do
COUNT=$(echo $line | grep -o "\," | wc -)
if [ $COUNT -eq 3 ]; then
echo $line >> $tempofile
fi
done < tmp.txt
If the line is matched, keep it, otherwise get to next line.
This simple command can remove all the lines that contains 3
$ awk '!/3/' file_name

Output a file in two columns in BASH

I'd like to rearrange a file in two columns after the nth line.
For example, say I have a file like this here:
This is a bunch
of text
that I'd like to print
as two
columns starting
at line number 7
and separated by four spaces.
Here are some
more lines so I can
demonstrate
what I'm talking about.
And I'd like to print it out like this:
This is a bunch and separated by four spaces.
of text Here are some
that I'd like to print more lines so I can
as two demonstrate
columns starting what I'm talking about.
at line number 7
How could I do that with a bash command or function?
Actually, pr can do almost exactly this:
pr --output-tabs=' 1' -2 -t tmp1
↓
This is a bunch and separated by four spaces.
of text Here are some
that I'd like to print more lines so I can
as two demonstrate
columns starting what I'm talking about.
at line number 7
-2 for two columns; -t to omit page headers; and without the --output-tabs=' 1', it'll insert a tab for every 8 spaces it added. You can also set the page width and length (if your actual files are much longer than 100 lines); check out man pr for some options.
If you're fixed upon “four spaces more than the longest line on the left,” then perhaps you might have to use something a bit more complex;
The following works with your test input, but is getting to the point where the correct answer would be, “just use Perl, already;”
#!/bin/sh
infile=${1:-tmp1}
longest=$(longest=0;
head -n $(( $( wc -l $infile | cut -d ' ' -f 1 ) / 2 )) $infile | \
while read line
do
current="$( echo $line | wc -c | cut -d ' ' -f 1 )"
if [ $current -gt $longest ]
then
echo $current
longest=$current
fi
done | tail -n 1 )
pr -t -2 -w$(( $longest * 2 + 6 )) --output-tabs=' 1' $infile
↓
This is a bunch and separated by four spa
of text Here are some
that I'd like to print more lines so I can
as two demonstrate
columns starting what I'm talking about.
at line number 7
… re-reading your question, I wonder if you meant that you were going to literally specify the nth line to the program, in which case, neither of the above will work unless that line happens to be halfway down.
Thank you chatraed and BRPocock (and your colleague). Your answers helped me think up this solution, which answers my need.
function make_cols
{
file=$1 # input file
line=$2 # line to break at
pad=$(($3-1)) # spaces between cols - 1
len=$( wc -l < $file )
max=$(( $( wc -L < <(head -$(( line - 1 )) $file ) ) + $pad ))
SAVEIFS=$IFS;IFS=$(echo -en "\n\b")
paste -d" " <( for l in $( cat <(head -$(( line - 1 )) $file ) )
do
printf "%-""$max""s\n" $l
done ) \
<(tail -$(( len - line + 1 )) $file )
IFS=$SAVEIFS
}
make_cols tmp1 7 4
Could be optimized in many ways, but does its job as requested.
Input data (configurable):
file
num of rows borrowed from file for the first column
num of spaces between columns
format.sh:
#!/bin/bash
file=$1
if [[ ! -f $file ]]; then
echo "File not found!"
exit 1
fi
spaces_col1_col2=4
rows_col1=6
rows_col2=$(($(cat $file | wc -l) - $rows_col1))
IFS=$'\n'
ar1=($(head -$rows_col1 $file))
ar2=($(tail -$rows_col2 $file))
maxlen_col1=0
for i in "${ar1[#]}"; do
if [[ $maxlen_col1 -lt ${#i} ]]; then
maxlen_col1=${#i}
fi
done
maxlen_col1=$(($maxlen_col1+$spaces_col1_col2))
if [[ $rows_col1 -lt $rows_col2 ]]; then
rows=$rows_col2
else
rows=$rows_col1
fi
ar=()
for i in $(seq 0 $(($rows-1))); do
line=$(printf "%-${maxlen_col1}s\n" ${ar1[$i]})
line="$line${ar2[$i]}"
ar+=("$line")
done
printf '%s\n' "${ar[#]}"
Output:
$ > bash format.sh myfile
This is a bunch and separated by four spaces.
of text Here are some
that I'd like to print more lines so I can
as two demonstrate
columns starting what I'm talking about.
at line number 7
$ >

How to delete all text on a line appearing after a particular symbol?

I have a file, file1.txt, like this:
This is some text.
This is some more text. ② This is a note.
This is yet some more text.
I need to delete any text appearing after "②", including the "②" and any single space appearing immediately before, if such a space is present. E.g., the above file would become file2.txt:
This is some text.
This is some more text.
This is yet some more text.
How can I delete the "②", anything coming after, and any preceding single space?
The solutions at How can I remove all text after a character in bash? do not seem to work, perhaps because "②" is not an ordinary character.
The file is saved in UTF-8.
A Perl solution:
$ perl -CS -i~ -p -E's/ ②.*//' file1.txt
You'll end up with the correct data in file1.txt and a backup of the original file in file1.txt~.
I hope you do realize most unix utilities do not work with unicode. I assume your input is in UTF-8, if not you have to adjust accordingly.
#!/bin/bash
function px {
local a="$#"
local i=0
while [ $i -lt ${#a} ]
do
printf \\x${a:$i:2}
i=$(($i+2))
done
}
(iconv -f UTF8 -t UTF16 | od -x | cut -b 9- | xargs -n 1) |
if read utf16header
then
echo -e $utf16header
out=''
while read line
do
if [ "$line" == "000a" ]
then
out="$out $line"
echo -e $out
out=''
else
out="$out $line"
fi
done
if [ "$out" != '' ] ; then
echo -e $out
fi
fi |
(perl -pe 's/( 0020)* 2461 .*$/ 000a/;s/ *//g') |
while read line
do
px $line
done | (iconv -f UTF16 -t UTF8 )
sed -e "s/[[:space:]]②[^\.]*\.//"
However, I am not sure that the ② symbol is parsed correctly. Maybe you have to use UTF8 codes or something like.
Try this:
sed -e '/②/ s/[ ]*②.*$//'
/②/ look only for the lines containing the magic symbol;
[ ]* for any number (matches none) of spaces before the magic symbol;
.*$ everything else till the end of line.

Resources