Bash output (triming string) - bash

Can someone explain to me, what exactly am i trimming? Like everything, that the \n, the head, the cut, the -d etc. means.
classnumber=$(cat "ClassTimetable.aspx?CourseId=156784&TermCode=1620" | tr '\n' '\r' | head -n 1 | cut -d '>' -f1235- | cut -d '<' -f1)
Thanks

cat "ClassTimetable.aspx?CourseId=156784&TermCode=1620" | \
tr '\n' '\r' | # replace Line Feed with Carriage Return
head -n 1 | # take what's in the first line
cut -d '>' -f1235- | # take everything after the 1235th '>' until end of line
cut -d '<' -f1) # take the first chunk before the first '<'
Experiment with it to understand what it does, for example try to reduce it to
echo "1>2>3>4>5>6>7>8>9>10>11>12>13><14>15" | cut -d '>' -f12- | cut -d '<' -f1
And come up with an explanation on how does that work.

You are trimming
\n , \r
First line
Getting field, 1,2,3,5 (until end of line). '>' is the delimiter
And from result of 3, getting the first column ('<' is the delimiter)

Related

Extract words within curly quotes but keep it when used as apostrophe

I have a UTF-8 file which has curly quotes ‘Awaara’ like these and in some places curly quotes are used such as don’t and don't' . The issue arises when trying to convert these curly quotes to single quotes. After converting to single quotes, I am unable to extract the single quotes words 'Awaara' without removing all single quotes used as don't , I'm.
GOAL: Convert curly--> single, remove single quotes yet keep apostrophied single quotes.
Here's the code I have written which convert yet fails to remove words within single quotes:
#!/bin/bash
cat $1 | sed -e "s/\’/'/g" -e "s/\‘/'/g" | sed -e "s/^'/ /g" -e "s/'$/ /g" | sed "s/\…/ /g" | tr '>' ' ' | tr '?' ' ' | tr ',' ' ' | tr ';' ' ' | tr '.' ' ' | tr '!' ' ' | tr '′' ' ' | tr ':' ' ' | sed -e "s/\[/ /g" -e "s/\]/ /g" -e 's/(/ /g' -e "s/)/ /g" | tr ' ' '\n' | sort -u | uniq | tr 'a-z' 'A-Z' >our_vocab.txt
The output is:
'AWAARA ---> Should be AWAARA
25
50
70
800
A
AD
AI
AMITABH
AND
ANYWAY
ARE
BACHCHAN
BECAUSE
BUT
C++
CAN
CHECK
COMPUTER
DEVAKI
DIFFICULT
.
.
.
HOON' --> Should be HOON
You can use
sed -E -e "s/([[:alpha:]]['’][[:alpha:]])|['‘’]/\\1/g" \
-e 's/[][()>?,;.!:]|′|…/ /g' "$1" | tr ' ' '\n' | sort -u | \
tr 'a-z' 'A-Z' > our_vocab.txt
See the online demo.
I merged several tr commands into a single (second) sed command, and the ([[:alpha:]]['’][[:alpha:]])|['‘’] regex removes all '‘’ apostrophes other than those in between letters.

exiting an IF statement after initial match bash scripting

I have a script which iterates through a file and finds matches in another file. How to I get the process to stop once I've found a match.
For example:
I take the first line in name.txt, and then try to find a match for it in file.txt.
name.txt:
7,7,FRESH,98,135,
65,10,OLD,56,45,
file.txt:
7,7,Dave,S
8,10,Frank,S
31,7,Gregg
45,5,Jake,S
Script:
while read line
do
name_id=`echo $line | cut -f1,2 -d ','`
identiferOne=`echo $name_id | cut -f1 -d ','`
identiferTwo=`echo $name_id | cut -f2 -d ','`
while IFS= read line
do
CHECK=`echo $line | cut -f4 -d','`
if [ $CHECK = "S" ]
then
symbolName=`echo $line | cut -f3 -d ','`
numberOne=`echo $line | awk -F',' '{print $1}'`
numberTwo=`echo $line | cut -f2 -d ','`
if [ "$numberOne" == $identiferOne ] && [ "$numberTwo" == $identifierTwo ]
then
echo "WE HAVE A MATCH with $symbolName"
break
fi
fi
done < /tmp/file.txt
done < /tmp/name.txt
My question is - how do I stop the script from iterating through file.txt once it has found an initial match, and then set that matched record into a variable, stop the if statement, then do some other stuff within the loop using that variable. I tried using break; but that exits the loop, which is not what I want.
You can tell grep different things:
Stop searching after the first match (option -m 1).
Read the searchkeys from a file (option -f file).
Pretend that the output of a command is a file (not really grep, bash helps here) with <(cmmnd).
Combining these will give you
grep -m1 -f <(cut -d"," -f1-2 name.txt) file.txt
Close, but not what you want. The substrings given by cut -d"," -f1-2 name.txt will match everywhere in the line, and you want to match the first two fields. Matching at the start of the line is done with ^, so we use sed to make strings like ^field1,field2 :
grep -m1 -f <(sed 's/\([^,]*,[^,]*,\).*/^\1/' name.txt) file.txt

Extract values from a comma delimited file when delimiter is coming in data itself

Hi I have a requirement where I want to extract values from a comma delimited file. The problem arised when file delimiter comes as a data value. All the values will come in a pair of single quotes and if some value is not coming then it will be blank.
Example:
cat file1.dat
'Data1','DataA',,',',,'2','0','0'
'Data2','DataB','X','D','3','1','2'
In the script I am doing the following
while read line
do
F1=`echo $line | cut -d"," -f1`
F2=`echo $line | cut -d"," -f2`
F3=`echo $line | cut -d"," -f3`
F4=`echo $line | cut -d"," -f4`
print $F1
print $F2
print $F3
print $F4
done < file1.dat
Present output:
'Data1'
'DataA'
'
'Data2'
'DataB'
'X'
'D'
Desired Output:
'Data1'
'DataA'
','
'Data2'
'DataB'
'X'
'D'
The following solution assumes that you have a character that doesn't appear in the input. Say the character | doesn't appear in file1.dat, then the following would yield the desired result:
$ sed "s/,',/,'|/" file1.dat | cut -d, -f1-4 --output-delimiter=$'\n' | tr '|' ','
'Data1'
'DataA'
','
'Data2'
'DataB'
'X'
'D'

shell cut command to remove characters

The current code below the grep & cut is outputting 51315123&category_id , I need to remove &category_id can cut be used to do that?
... | tr '?' '\n' | grep "^recipient_dvd_id=" | cut -d '=' -f 2 >> dvdIDs.txt
Yes, I would think so
... | cut -d '&' -f 1
If you're open to using AWK:
... | tr '?' '\n' | awk -F'[=&]' '/^recipient_dvd_id=/ {print $2}' >> dvdIDs.txt
AWK handles the regex and splitting fields, in this case using both '=' and '&' as field separators and printing the second column. Otherwise, you will need a second cut statement.

sort | uniq | xargs grep ... where lines contain spaces

I have a comma delimited file "myfile.csv" where the 5th column is a date/time stamp. (mm/dd/yyyy hh:mm). I need to list all the rows that contain duplicate dates (there are lots)
I'm using a bash shell via cygwin for WinXP
$ cut -d, -f 5 myfile.csv | sort | uniq -d
correctly returns a list of the duplicate dates
01/01/2005 00:22
01/01/2005 00:37
[snip]
02/29/2009 23:54
But I cannot figure out how to feed this to grep to give me all the rows.
Obviously, I can't use xargs straight up since the output contains spaces. I thought I could do uniq -z -d but for some reason, combining those flags causes uniq to (apparently) return nothing.
So, given that
$ cut -d, -f 5 myfile.csv | sort | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
doesn't work... what can I do?
I know that I could do this in perl or another scripting language... but my stubborn nature insists that I should be able to do it in bash using standard commandline tools like sort, uniq, find, grep, cut, etc.
Teach me, oh bash gurus. How can I get the list of rows I need using typical cli tools?
sort -k5,5 will do the sort on fields and avoid the cut;
uniq -f 4 will ignore the first 4 fields for the uniq;
Plus a -D on the uniq will get you all of the repeated lines (vs -d, which gets you just one);
but uniq will expect tab-delimited instead of csv, so tr '\t' ',' to fix that.
Problem is if you have fields after #5 that are different. Are your dates all the same length? You might be able to add a -w 16 (to include time), or -w 10 (for just dates), to the uniq.
So:
tr '\t' ',' < myfile.csv | sort -k5,5 | uniq -f 4 -D -w 16
The -z option of uniq needs the input to be NUL separated. You can filter the output of cut through:
tr '\n' '\000'
To get zero separated rows. Then sort, uniq and xargs have options to handle that. Try something like:
cut -d, -f 5 myfile.csv | tr '\n' '\000' | sort -z | uniq -d -z | xargs -0 -I {} grep '{}' myfile.csv
Edit: the position of tr in the pipe was wrong.
You can tell xargs to use each line as an argument in its entirety using the -d option. Try:
cut -d, -f 5 myfile.csv | sort | uniq -d | xargs -d '\n' -I '{}' grep '{}' myfile.csv
This is a good candidate for awk:
BEGIN { FS="," }
{ split($5,A," "); date[A[0]] = date[A[0]] " " NR }
END { for (i in date) print i ":" date[i] }
Set field seperator to ',' (CSV).
Split fifth field on the space, stick result in A.
Concatenate the line number to the list of what we have already stored for that date.
Print out the line numbers for each date.
Try escaping the spaces with sed:
echo 01/01/2005 00:37 | sed 's/ /\\ /g'
cut -d, -f 5 myfile.csv | sort | uniq -d | sed 's/ /\\ /g' | xargs -I '{}' grep '{}' myfile.csv
(Yet another way would be to read the duplicate date lines into an IFS=$'\n' array and iterate over it in a for loop.)

Resources