CSV find blank value in third column KSH - shell

Hi my data set is simple as show below
4,a,1.5
t,6,,
6,t,h
I am trying to use awk or grep to count the rows in which there is a blank in the third colmn. In this case it would be 1 since only the middle one has a blank in the column so far what i have tried is below. The login is trying to use awk to search for a blank string then count it the same with grep find where there is a blank in the third column then count it.
COUNT=$('awk '' $DATAFILE | wc -l')
COUNT=$('grep -e '.*,.*,,' $DATAFILE' | wc -l)

awk -F, '$3==""{c++} END{print c+0}' file

Your grep has to much quotes:
count=$(grep -E ".*,.*,," $DATAFILE | wc -l)
would work a bit, but you do not want to match a line with an emty fourth field.
Better seems to be
count=$(grep -E "^[^,]*,[^,]*,," $DATAFILE | wc -l)
This will still give problems with input like
field1,"field 2 with , insides quotes",,
Your question said nothing about this situation, what do you consider to be the third field here? That would be another question.
Edit:
#Sundeep commented correctly, that you could use the grep -c, avoiding wc -l. I tried to show what was wrong in the OP's answer, but I should have added the advice to use -c.

Related

Shell: Counting lines per column while ignoring empty ones

I am trying to simply count the lines in the .CSV per column, while at the same time ignoring empty lines.
I use below and it works for the 1st column:
cat /path/test.csv | cut -d, -f1 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 8
And below for the 2nd column:
cat /path/test.csv | cut -d, -f2 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 6
But when I try to count 3rd column, it simply Outputs the Total number of lines in the whole .CSV.
cat /path/test.csv | cut -d, -f3 | grep . | wc -l` >> ~/Desktop/Output.csv
#Outputs: 33
#Should be: 19?
I've also tried to use awk instead of cut, but get the same issue.
I have tried creating new file thinking maybe it had some spaces in the lines, still the same.
Can someone clarify what is the difference? Betwen reading 1-2 column and the rest?
20355570_01.tif,,
20355570_02.tif,,
21377804_01.tif,,
21377804_02.tif,,
21404518_01.tif,,
21404518_02.tif,,
21404521_01.tif,,
21404521_02.tif,,
,22043764_01.tif,
,22043764_02.tif,
,22095060_01.tif,
,22095060_02.tif,
,23507574_01.tif,
,23507574_02.tif,
,,23507574_03.tif
,,23507804_01.tif
,,23507804_02.tif
,,23507804_03.tif
,,23509247_01.tif
,,23509247_02.tif
,,23509247_03.tif
,,23527663_01.tif
,,23527663_02.tif
,,23527663_03.tif
,,23527908_01.tif
,,23527908_02.tif
,,23527908_03.tif
,,23535506_01.tif
,,23535506_02.tif
,,23535562_01.tif
,,23535562_02.tif
,,23535636_01.tif
,,23535636_02.tif
That happens when input file has DOS line endings (\r\n). Fix your file using dos2unix and your command will work for 3rd column too.
dos2unix /path/test.csv
Or, you can remove the \r at the end while counting non-empty columns using awk:
awk -F, '{sub(/\r/,"")} $3!=""{n++} END{print n}' /path/test.csv
The problem is in the grep command: the way you wrote it will return 33 lines when you count the 3rd column.
It's better instead to use the following command to count number of lines in .CSV for each column (example below is for the 3rd column):
cat /path/test.csv | cut -d , -f3 | grep -cve '^\s*$'
This will return the exact number of lines for each column and avoid of piping into wc.
See previous post here:
count (non-blank) lines-of-code in bash
edit: I think oguz ismail found the actual reason in their answer. If they are right and your file has windows line endings you can use one of the following commands without having to convert the file.
cut -d, -f3 yourFile.csv cut | tr -d \\r | grep -c .
cut -d, -f3 yourFile.csv | grep -c $'[^\r]' # bash only
old answer: Since I cannot reproduce your problem with the provided input I take a wild guess:
The "empty" fields in the last column contain spaces. A field containing a space is not empty altough it looks like it is empty as you cannot see spaces.
To count only fields that contain something other than a space adapt your regex from . (any symbol) to [^ ] (any symbol other than space).
cut -d, -f3 yourFile.csv | grep -c '[^ ]'

BASH script help using TOP, GREP and CUT

Use Top command which repeats 5 times, pipe the results to Grep and Cut command to print the PID for init process on your screen.
Hi all, I have my line of code:
top -n 5 | grep "init" | cut -d" " -f3 > topdata
But I cannot see any output to verify that it's working.
Also, the next script asks me to use a one line command which shows the total memory used in megabytes. I'm supposed to pipe results from Free to Grep to select or filter the lines with the pattern "Total:" then pipe that result to Cut and display the number representing total memory used. So far:
free -m -t | grep "total:" | cut -c25-30
Also not getting any print return on that one. Any help appreciated.
expanding on my comments:
grep is case sensitive. free says "Total", you grep "total". So no match! Either grep for "Total" or use grep -i.
Instead of cut, I prefer awk when I need to get a number out of a line. You do not know what length the number will be, but you know it will be the first number after Total:. So:
free -m -t | grep "Total:" | awk '{print $2}'
For your top command, if you have no init process (which you should, but it would probably not show in top), just grep for something else to see if your code works. I used cinnamon (running Mint). The top command is:
top -n 5 | grep "cinnamon" | awk '{print $1}'
Replace "cinnamon" by "init" for your requirement. Why $1 in the awk? My top puts the PID in the first column. Adjust accordingly.
Overall, using cut is good when you have a string that is delimited by some character. Ex. aaa;bbb;ccc, you would cut on -d';'. But here the numbers might have different lengths so using cut is not (IMHO) the best solution.
The init process has PID 1, to there's no reason to do like this.
To find the PID of a process in general, I'd recommend:
pidof <name>

Shell cut delimiter before last

I`m trying to cut a string (Name of a file) where I have to get a variable in the name.
But the problem is, I have to put it in a shell variable, until now it is ok.
Here is the example of what i have to do.
NAME_OF_THE_FILE_VARIABLEiWANTtoGET_DATE
NAMEfile_VARIABLEiWANT_DATE
NAME_FILE_VARIABLEiWANT_DATE
The position of the variable I want always can change, but it will be always 1 before last. The delimiter is the "_".
Is there a way to count the size of the array to get size-1 or something like that?
OBS: when i cut strings I always use things like that:
VARIABLEiWANT=`echo "$FILENAME" | cut 1 -d "_"`
awk -F'_' '{print $(NF-1)}' file
or you have a string
awk -F'_' '{print $(NF-1)}' <<< "$FILENAME"
save the output of above oneliner into your variable.
IFS=_ read -a array <<< "$FILENAME"
variable_i_want=${array[${#array[#]}-2]}
It's a bit of a mess visually, but it's more efficient than starting a new process. ${#array[#]} is the number of elements read from FILENAME, so the indices for the array range from 0 to ${#array[#]}-1.
As of bash 4.3, though, you can use a negative index instead of computing it.
variable_i_want=${array[-2]}
If you need POSIX compatibility (no arrays), then
tmp=${FILENAME%_${FILENAME##*_}} # FILENAME with last field removed
variable_i_want=${tmp##*_} # last field of tmp
Just got it... I find someone using a cat function... I got to use it with the echo... and rev. didn't understand the rev thing, but I think it revert the order of the delimiter.
CODIGO=`echo "$ARQ_NAME" | rev | cut -d "_" -f 2 | rev `

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Unexpected variable update when using bash's $(( )) operator for arithmetic

I'm trying to trim a few lines from a file. I know exactly how many lines to remove (say, 2 from the top), but not how many total lines are in the file. So I tried this straightforward solution:
$ wc -l $FILENAME
119559 my_filename.txt
$ LINES=$(wc -l $FILENAME | awk '{print $1}')
$ tail -n $(($LINES - 2)) $FILENAME > $OUTPUT_FILE
The output is fine, but what happened to LINES??
$ wc -l $OUTPUT_FILE
119557 my_output_file.txt
$ echo $LINES
107
Hoping someone can help me understand what's going on.
$LINES has a special meaning. It is the number of rows the terminal has, and if you resize your terminal window, it will be re-set. See info "(bash)Bash Variables".
It always helps to decompose where you thing the problem is. Running
wc -l $FILENAME | awk '{print $1}'
should probably show you where the problem is.
Instead, use
LINES=$(wc -l < $FILENAME )
Hm.. Yes, I'm afraid #MichaelHoffman is probably has diagnosed your problem more accurately.
I hope this helps.
You could also just do sed 'X,Yd' < file
Where X,Y is the range of the lines you want to omit (in this case it would be 1,2).
Other alternatives are:
sed 'X,+Yd' omits Y lines starting from line X
sed /regex/,Yd' omits everything between the line where the regex matches and Y
sed '/regex/,+Yd' omits Y lines starting from where the regex matches
sed '/regex/,/regex/d' omits everything between the two regexs
Note: these are GNU sed extensions

Resources