Cut one word before delimiter - Bash - bash

How do I use cut to get one word before the delimiter? For example, I have the line below in file.txt:
one two three four five: six seven
when I use the cut command below:
cat file.txt | cut -d ':' -f1
...then I get everything before the delimiter; i.e.:
one two three four five
...but I only want to get "five"
I do not want to use awk or the position, because the file changes all the time and the position of "five" can be anywhere. The only thing fixed is that five will have a ":" delimiter.
Thanks!

Pure bash:
s='one two three four five: six seven'
w=${s%%:*} # cut off everything from the first colon
l=${w##* } # cut off everything until last space
echo $l
# => five
(If you have one colon in your file, s=$(grep : file) should set up your initial variable)

Since you need to use more that one field delimiter here, awk comes to rescue:
s='one two three four five: six seven'
awk -F '[: ]' '{print $5}' <<< "$s"
five
EDIT: If your field positions can change then try this awk:
awk -F: '{sub(/.*[[:blank:]]/, "", $1); print $1}' <<< "$s"
five
Here is a BASH one-liner to get this in a single command:
[[ $s =~ ([^: ]+): ]] && echo "${BASH_REMATCH[1]}"
five

you may want to do something like this:
cat file.txt | while read line
do
for word in $line
do
if [ `echo $word | grep ':$'` ] then;
echo $word
fi
done
done
if it is a consistent structure (with different number of words in line), you can change the first line to:
cat file.txt | cut -d':' -f1 | while read line
do ...
and that way to avoid processing ':' at the right side of the delimeter

Try
echo "one two three four five: six seven" | awk -F ':' '{print $1}' | awk '{print $NF}'
This will always print the last word before first : no matter what happens

Related

Trim line to the first comma (bash)

I have a line from which I need to cut the branch name to the first comma:
commit 2bea9e0351dae65f18d2de11621049b465b1e868 (HEAD, origin/MGB-322, refs/pipelines/36877)
I need to cut out MGB-322.
The number of characters in a line is always different.
awk -F "origin/" '{print $2}' - this is how I cut out
MGB-322, refs/pipelines/36877)
But how to tell it to trim to the first comma?
I tried doing it via substr,
awk -F "origin/" '{print substr ($2,1, index $2 ,)}'
But it is not clear how to correctly specify the comma in index
With any awk. Use / and , as field separator:
awk '{print $3}' FS='[/,]' file
Output:
MGB-322
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
With OP's code fix: considered that you have only occurrence of origin in case you have more than occurrence then change $NF to $2 in following code. Written and tested in https://ideone.com/xjv2we
awk -F"origin/" '{print $NF}' Input_file
sed could be also helpful here, generic solution it's based on first occurrence of comma and / as per OP's thread title. I have written this on mobile so couldn't test it as of now should with though and will test it after sometime.
sed 's/\([^,]*\),\([^/]*\)\/\(.*\)/\3/' Input_file
"I need to cut out MGB-322."
You can use cut in two steps:
echo "${line}" | cut -d"/" -f2 | cut -d"," -f1
I would prefer one step with awk (already anwered by others) or sed
echo "${line}" | sed -r 's/.*origin.(.*), refs.*/\1/'
Why spawn procs? bash's built-in parameter parsing will handle this.
If
$: line="commit 2bea9e0351dae65f18d2de11621049b465b1e868 (HEAD, origin/MGB-322, refs/pipelines/36877)"
then
$: [[ "$line" =~ .*origin.(.*), ]] && echo "${BASH_REMATCH[1]}"
MGB-322
or maybe
$: tmp=${line#*, origin/}; echo ${tmp%,*}
MGB-322
or even
$: IFS=",/" read _ _ x _ <<< "$line" && echo $x
MGB-322
c.f. https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html

Bash Shell: Infinite Loop

The problem is the following I have a file that each line has this form:
id|lastName|firstName|gender|birthday|joinDate|IP|browser
i want to sort alphabetically all the firstnames in that file and print them one on each line but each name only once
i have created the following program but for some reason it creates an infinite loop:
array1=()
while read LINE
do
if [ ${LINE:0:1} != '#' ]
then
IFS="|"
array=($LINE)
if [[ "${array1[#]}" != "${array[2]}" ]]
then
array1+=("${array[2]}")
fi
fi
done < $3
echo ${array1[#]} | awk 'BEGIN{RS=" ";} {print $1}' | sort
NOTES
if [ ${LINE:0:1} != '#' ] : this command is used because there are comments in the file that i dont want to print
$3 : filename
array1 : is used for all the seperate names
Wow, there's a MUCH simpler and cleaner way to achieve this, without having to mess with the IFS variable or using arrays. You can use "for" to do this:
First I created a file with the same structure as yours:
$ cat file
id|lastName|Douglas|gender|birthday|joinDate|IP|browser
id|lastName|Tim|gender|birthday|joinDate|IP|browser
id|lastName|Andrew|gender|birthday|joinDate|IP|browser
id|lastName|Sasha|gender|birthday|joinDate|IP|browser
#id|lastName|Carly|gender|birthday|joinDate|IP|browser
id|lastName|Madson|gender|birthday|joinDate|IP|browser
Here's the script I wrote using "for":
#!/bin/bash
for LINE in `cat file | grep -v "^#" | awk -F'|' '{print$3}' | sort -u`
do
echo $LINE
done
And here's the output of this script:
$ ./script.sh
Andrew
Douglas
Madson
Sasha
Tim
Explanation:
for LINE in `cat file`
Creates a loop that reads each line of "file". The commands between ` are run by linux, for example, if you wanted to store the date inside of a variable you could use "VARDATE=`date`".
grep -v "^#"
The option -v is used to exclude results matching the pattern, in this case the pattern is "^#". The "^" character means "line begins with". So grep -v "^#" means "exclude lines beginning with #".
awk -F'|' '{print$3}'
The -F option switches the column delimiter from the default (the default is a space) to whatever you put between ' after it, in this case the "|" character.
The '{print$3}' prints the 3rd column.
sort -u
And the "sort -u" command to sort the names alphabetically.

How to process large csv files efficiently using shell script, to get better performance than that for following script?

I have a large csv file input_file with 5 columns. I want to do two things to second column:
(1) Remove last character
(2) Append leading and trailing single quote
Following are the sample rows from input_file.dat
420374,2014-04-06T18:44:58.314Z,214537888,12462,1
420374,2014-04-06T18:44:58.325Z,214537850,10471,1
281626,2014-04-06T09:40:13.032Z,214535653,1883,1
Sample output would look like :
420374,'2014-04-06T18:44:58.314',214537888,12462,1
420374,'2014-04-06T18:44:58.325',214537850,10471,1
281626,'2014-04-06T09:40:13.032',214535653,1883,1
I have written a following code to do the same.
#!/bin/sh
inputfilename=input_file.dat
outputfilename=output_file.dat
count=1
while read line
do
echo $count
count=$((count + 1))
v1=$(echo $line | cut -d ',' -f1)
v2=$(echo $line | cut -d ',' -f2)
v3=$(echo $line | cut -d ',' -f3)
v4=$(echo $line | cut -d ',' -f4)
v5=$(echo $line | cut -d ',' -f5)
v2len=${#v2}
v2len=$((v2len -1))
newv2=${v2:0:$v2len}
newv2="'$newv2'"
row=$v1,$newv2,$v3,$v4,$v5
echo $row >> $outputfilename
done < $inputfilename
But it's taking lot of time.
Is there any efficient way to achieve this?
You can do this with awk
awk -v q="'" 'BEGIN{FS=OFS=","} {$2=q substr($2,1,length($2)-1) q}1' input_file.dat
How it works:
BEGIN{FS=OFS=","} : set input and output field separator (FS, OFS) to ,.
-v q="'" : assign a literal single quote to the variable q (to avoid complex escaping in the awk expression)
{$2=q substr($2,1,length($2)-1) q} : Replace the second field ($2) with a single quote (q) followed by the value of the 2nd field without the last character (substr(string, start, length)) and appending a literal single quote (q) at the end.
1 : Just invoke the default action, which is print the current (edited) line.

Bash : How to check in a file if there are any word duplicates

I have a file with 6 character words in every line and I want to check if there are any duplicate words. I did the following but something isn't right:
#!/bin/bash
while read line
do
name=$line
d=$( grep '$name' chain.txt | wc -w )
if [ $d -gt '1' ]; then
echo $d $name
fi
done <$1
Assuming each word is on a new line, you can achieve this without looping:
$ cat chain.txt | sort | uniq -c | grep -v " 1 " | cut -c9-
You can use awk for that:
awk -F'\n' 'found[$1] {print}; {found[$1]++}' chain.txt
Set the field separator to newline, so that we look at the whole line. Then, if the line already exists in the array found, print the line. Finally, add the line to the found array.
Note: If a line will only be suppressed once, so if the same line appears, say, 6 times, it will be printed 5 times.

unix command to get lines from in between first and last occurence of a word and write to a file

I want a unix command to find the lines between first & last occurence of a word
For example:
let's imagine we have 1000 lines. Tenth line contains word "stackoverflow", thirty fifth line also contains word "stackoverflow".
I want to print lines between 10 and 35 and write it to a new file.
You can make it in two steps. The basic idea is to:
1) get the line number of the first and last match.
2) print the range of lines in between these range.
$ read first last <<< $(grep -n stackoverflow your_file | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file
Explanation
read first last reads two values and stores them in $first and $last.
grep -n stackoverflow your_file greps and shows the output like this: number_of_line:output
awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}') prints the number of line of the first and last match of stackoverflow in the file.
And
awk -v f=$first -v l=$last 'NR>=f && NR<=l' your_file prints all lines from $first line number till $last line number.
Test
$ cat a
here we
have some text
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
to make more fun
blablabla
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ awk -v f=$first -v l=$last 'NR>=f && NR<=l' a
stackoverflow
and other things
bla
bla
bla bla
stackoverflow
and whatever else
stackoverflow
By steps:
$ grep -n stackoverflow a
3:stackoverflow
9:stackoverflow
11:stackoverflow
$ grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}'
3 11
$ read first last <<< $(grep -n stackoverflow a | awk -F: 'NR==1 {printf "%d ", $1}; END{print $1}')
$ echo "first=$first, last=$last"
first=3, last=11
If you know an upper bound of how many lines there can be (say, a million), then you can use this simple abusive script:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow) < file
You can append | tail -n +2 | head -n -1 to strip the border lines as well:
(grep -A 100000 stackoverflow | grep -B 1000000 stackoverflow
| tail -n +2 | head -n -1) < file
I'm not 100% sure from the question whether the output should be inclusive of the first and last matching lines, so I'm assuming it is. But this can be easily changed if we want exclusive instead.
This pure-bash solution does it all in one step - i.e. the file (or pipe) is only read once:
#!/bin/bash
function midgrep {
while read ln; do
[ "$saveline" ] && linea[$((i++))]=$ln
if [[ $ln =~ $1 ]]; then
if [ "$saveline" ]; then
for ((j=0; j<i; j++)); do echo ${linea[$j]}; done
i=0
else
saveline=1
linea[$((i++))]=$ln
fi
fi
done
}
midgrep "$1"
Save this as a script (e.g. midgrep.sh) and pipe whatever output you like to it as follows:
$ cat input.txt | ./midgrep.sh stackoverflow
This works as follows:
find the first matching line and buffer in the first element of an array
continue reading lines until the next match, buffering to the array as we go
on each subsequent matches, flush the buffer array to output
continue reading file to the end. If there are no more matches, then the last buffer is simply discarded.
The advantage of this approach is that we only read through the input one time only. The disadvantage is that we buffer everything between each match - if there are many lines between each match, then these are all buffered to memory, until we hit the next match.
Also this uses the bash =~ regular expression operator to keep this pure bash. But you could replace this with a grep instead, if you are more comfortable with that.
Using perl :
perl -00 -lne '
chomp(my #arr = split /stackoverflow/);
print join "\nstackoverflow", #arr[1 .. $#arr -1 ]
' file.txt | tee newfile.txt
The idea behind this is to feed an array of the whole input file in to chunks using "stackoverflow" string to split. Next, we print the 2nd occurrences to the last -1 with join "stackoverflow".

Resources