Remove everything in a pipe delimited file after second-to-last pipe - bash

How can remove everything in a pipe delimited file after the second-to-last pipe? Like for the line
David|3456|ACCOUNT|MALFUNCTION|CANON|456
the result should be
David|3456|ACCOUNT|MALFUNCTION

Replace |(string without pipe)|(string without pipe) at the end of each line:
sed 's/|[^|]*|[^|]*$//' inputfile

Using awk, something like
awk -F'|' 'BEGIN{OFS="|"}{NF=NF-2; print}' inputfile
David|3456|ACCOUNT|MALFUNCTION
(or) use cut if you know the number of columns in total, i,e 6 -> 4
cut -d'|' -f -4 inputfile
David|3456|ACCOUNT|MALFUNCTION

The command I would use is
cat input.txt | sed -r 's/(.*)\|.*/\1/' > output.txt

A pure Bash solution:
while IFS= read -r line || [[ -n $line ]] ; do
printf '%s\n' "${line%|*|*}"
done <inputfile
See Reading input files by line using read command in shell scripting skips last line (particularly the answer by Jahid) for details of how the while loop works.
See pattern matching in Bash for information about ${line%|*|*}.

Related

How to remove consecutive repeating characters from every line?

I have the below lines in a file
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;;;;
Acanthocephala;;;;;;;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;;Polymorphus;;
and I want to remove the repeating semi-colon characters from all lines to look like below (note- there are repeating semi-colons in the middle of some of the above lines too)
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Profilicollis;Profilicollis_altmani;
Acanthocephala;Eoacanthocephala;Neoechinorhynchida;Neoechinorhynchidae;
Acanthocephala;
Acanthocephala;Palaeacanthocephala;Polymorphida;Polymorphidae;Polymorphus;
I would appreciate if someone could kindly share a bash one-liner to accomplish this.
You can use tr with "squeeze":
tr -s ';' < infile
perl -p -e 's/;+/;/g' myfile # writes output to stdout
or
perl -p -i -e 's/;+/;/g' myfile # does an in-place edit
If you want to edit the file itself:
printf "%s\n" 'g/;;/s/;\{2,\}/;/g' w | ed -s foo.txt
If you want to pipe a modified copy of the file to something else and leave the original unchanged:
sed 's/;\{2,\}/;/g' foo.txt | whatever
These replace runs of 2 or more semicolons with single ones.
could be solved easily by substitutions.
I add an awk solution by playing with the FS/OFS variable:
awk -F';+' -v OFS=';' '$1=$1' file
or
awk -F';+' -v OFS=';' '($1=$1)||1' file
Here's a sed version of alaniwi's answer:
sed 's/;\+/;/g' myfile # Write output to stdout
or
sed -i 's/;\+/;/g' myfile # Edit the file in-place

Extract first word in colon separated text file

How do i iterate through a file and print the first word only. The line is colon separated. example
root:01:02:toor
the file contains several lines. And this is what i've done so far but it does'nt work.
FILE=$1
k=1
while read line; do
echo $1 | awk -F ':'
((k++))
done < $FILE
I'm not good with bash-scripting at all. So this is probably very trivial for one of you..
edit: variable k is to count the lines.
Use cut:
cut -d: -f1 filename
-d specifies the delimiter
-f specifies the field(s) to keep
If you need to count the lines, just
count=$( wc -l < filename )
-l tells wc to count lines
awk -F: '{print $1}' FILENAME
That will print the first word when separated by colon. Is this what you are looking for?
To use a loop, you can do something like this:
$ cat test.txt
root:hello:1
user:bye:2
test.sh
#!/bin/bash
while IFS=':' read -r line || [[ -n $line ]]; do
echo $line | awk -F: '{print $1}'
done < test.txt
Example of reading line by line in bash: Read a file line by line assigning the value to a variable
Result:
$ ./test.sh
root
user
A solution using perl
%> perl -F: -ane 'print "$F[0]\n";' [file(s)]
change the "\n" to " " if you don't want a new line printed.
You can get the first word without any external commands in bash like so:
printf '%s' "${line%%:*}"
which will access the variable named line and delete everything that matches the glob :* and do so greedily, so as close to the front (that's the %% instead of a single %).
Though with this solution you do need to do the loop yourself. If this is the only thing you want to do with the variable the cut solution is better so you don't have to do the file iteration yourself.

sed - unterminated `s' command

I have this peace of code:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}') ;
name=$(awk '{print $2}') ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
file BP.csv has this format:
GO:0008283 cell proliferation
GO:0009405 pathogenesis
GO:0010201 response to continuous far red light stimulus by the high-irradiance response system
GO:0009641 shade avoidance
while GOEA.csv has this format:
4577 GO:0006807 0.994 2014_06_01
4577 GO:0016788 0.989 2014_06_01
4577 GO:0043169 0.977 2014_06_01
4577 GO:0043170 0.963 2014_06_01
sed doesn't work. I want to change GO:0043170 for example, to string "pi", but it gives:
sed: -e expression #1, char 12: unterminated `s' command
Why?
Thanks.
You running your awk command against no input, Try this:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}' <<< "$line") ;
name=$(awk '{print $2}' <<< "$line" ) ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
Let's clean up this code a bit:
while read goterm name
do
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g"
done < BP.cvs
The problem is that your awk statements are attempting to read in from STDIN just like your while is doing. You're reading from the same input stream.
What you want to do is to pull out the values from your line. I'm using read to do this. The read statement uses the values in $IFS to separate out the input. This is normally spaces, tabs, and newlines. The read reads each variable you put on the line, and the last value read in contains the entire rest of the line.
Thus:
while read line
reads in the entire line while:
while goterm name
will break the line as
goterm="GO:0008283"
name="cell proliferation"
One more thing. When you use grep and sed together, you probably can get away with just sed:
while read goterm name
do
sed -n "/$goterm/s/$goterm/pi/gp" GOEA.csv
done < BP.csv
The format for the sed command is:
/lines/command/parameters/
So, I'm searching for lines with $goterm in them, then I am replacing $goterm with pi. The -n means don't print out the lines as sed processes them and p means to print out the lines were the substitute is located.
By the way, csv as a file suffix means comma separated values but neither file looks like it is comma separated. Are these tabs separating each field. If so, you'll need to modify $IFS to be tabs.
I would restructure that whole thing more like this:
while read goterm restofline
do
grep -w "${goterm}" GOEA.csv | sed -e "s/${goterm}/pi/g"
done < BP.csv
No reason for the awk things, as the bash read builtin will do rudimentary field splitting for you if you give it multiple variables. Also, you aren't using name anyway, so it's not needed. cat is unnecessary as well.
Depending on your exact use case, even the grep may be unnecessary, making the inner command simply sed -ne "s/${goterm}/pi/gp" GOEA.csv. Unless your purpose for the grep -w is eliminating lines where ${goterm} is a substring of a word instead of the whole word...
For future reference, inserting a set -x above your loop in your script would show you the exact commands that are being run, so that you can compare them with your expectations.

How to read specific lines in a file in BASH?

while read -r line will run through each line in a file. How can I have it run through specific lines in a file, for example, lines "1-20", then "30-100"?
One option would be to use sed to get the desired lines:
while read -r line; do
echo "$line"
done < <(sed -n '1,20p; 30,100p' inputfile)
Saying so would feed lines 1-20, 30-100 from the inputfile to read.
#devnull's sed command does the job. Another alternative is using awk since it avoids doing the read and you can do the processing in awk itself:
awk '(NR>=1 && NR<=20) || (NR>=30 && NR<=100) {print "processing $0"}' file

Counting commas in a line in bash

Sometimes I receive a CSV file which has a carriage return inside a cell. This is not an acceptable format to a program that will use it as input.
In order to detect if an input line is split, I determined that a bad line would not have the expected number of commas in it. Is there a bash or other common unix command line tool that would allow me to count the commas in the line? If necessary, I can write a Python or Perl program to do it, but if possible, I'd like to add a line or two to an existing bash script to cause it to fail if the comma count is wrong. Any ideas?
Strip everything but the commas, and then count number of characters left:
$ echo foo,bar,baz | tr -cd , | wc -c
2
To count the number of times a comma appears, you can use something like awk:
string=(line of input from CSV file)
echo "$string" | awk -F "," '{print NF-1}'
But this really isn't sufficient to determine whether a field has carriage returns in it. Fields can have commas inside as long as they're surrounded by quotes.
What worked for me better than the other solutions was this. If test.txt has:
foo,bar,baz
baz,foo,foobar,bar
Then cat test.txt | xargs -I % sh -c 'echo % | tr -cd , | wc -c' produces
2
3
This works very well for streaming sources, or tailing logs, etc.
In pure Bash:
while IFS=, read -ra array
do
echo "$((${#array[#]} - 1))"
done < inputfile
or
while read -r line
do
count=${line//[^,]}
echo "${#count}"
done < inputfile
Try Perl:
$ perl -ne 'print 0+#{[/,/g]},"\n"'
a
0
a,a
1
a,a,a,a,a
4
Depending on what you are trying to do with the CSV data, it may be helpful to use a wrapper script like csvquote to temporarily replace the problematic newlines (and commas) inside quoted fields, then restore them. For instance:
csvquote inputfile.csv | wc -l
and
csvquote inputfile.csv | cut -d, -f1 | csvquote -u
may be the sort of thing you're looking for. See [https://github.com/dbro/csvquote][1] for the code and more information
An example Python command you could run (since it's going to be installed on most modern shells) is:
python -c "import pathlib; print({l.count(',') for l in pathlib.Path('my_file.csv').read_text().splitlines()})"
This counts the number of commas per line, then makes a set from them (so if your lines all have the same number of commas in, you'll get a set with just that number in).
Just remove all of the carriage returns:
tr -d "\r" old_file > new_file

Resources