Split sentences into separate lines - bash

I'm trying to split sentences in a file into separate lines using a shell script.
Now I would like to split the strings by !, ? or . . The output should be like this :
The file that I want to read from my_text.txt and contains
you want to learn shell script? First, you want to learn Linux command! then. you can learn shell script.
Now I would like to split the strings by " ! " or "? " or "." The output should be like this :
you want to learn shell script
First, you want to learn Linux command
then
you can learn shell script
I used this script :
while read p
do
echo $p | tr "? ! ." "\n "
done < my_text.txt
But the output is:
you want to learn shell script
First, you want to learn Linux command then you can learn shell script
Can somebody help?

This could be done in a single awk using its global substitution option as follows, written and tested with shown samples only in GNU awk. Simply globally substituting ?,!,. with new line(by default ORS(output record separator) value as new line).
awk '{gsub(/\?|!|\./,ORS)} 1' Input_file

$ sed 's/[!?.]/\n/g' file
you want to learn shell script
First, you want to learn Linux command
then
you can learn shell script

You can call 3 tr commands to split for ? ! and .
cat test_string.txt | tr "!" "\n" | tr "?" "\n" | tr "." "\n"

Awk is ideal for this:
awk -F '[?.!]' '{ for (i=1;i<=NF;i++) { print $i } }' file
Set the field delimiters to ? or . or ! and then loop through each field and print the entry.

That's not how you use tr. Both arguments to it should be of the same length, otherwise the second one is extended to length of the first by repeating its last character*—that is, in this case, a space—to make one-by-one transliteration possible. In other words, given ? ! . and \n  as arguments, tr will replace ? with a line feed, and !, , and . with a space. What you're looking for is I guess:
$ tr '?!.' '\n' <file
you want to learn shell script
First, you want to learn Linux command
then
you can learn shell script
Or, more portably:
tr '?!.' '[\n*]' <file
*This is what GNU tr does, POSIX leaves the behavior unspecified when arguments aren't of the same length.

In gnu-awk we can get it with gensub() function:
awk '{print gensub(/([.?!]\s*)/, "\n", "g", $0)}' file
you want to learn shell script
First, you want to learn Linux command
then
you can learn shell script

why limit yourself to new line \n being the RS ? Maybe something like this :
\056 is the period. \040 is space. i'll add the + in case there have
been legacy practices of typing 2 spaces after each sentence and u
wanna standardize it.
I presume question mark \044 is more frequent
than exclamation \041. Only reason why i'm using all octal is that
all those are ones that can wreck havor on a terminal when just a
slight chance of didn't quoting and escaping properly.
Unlike FS or RS, OFS/ORS are constant strings (are they?), so typing in the characters will be safe.
the periods are taken care of by RS. No need special processing. So if the row contains neither ? nor ! , just print it as is, and move on (it'll handle the ". \n" )
.
mawk 'BEGIN { RS = "[\056][\040]+" ; ORS = ". \n";
FS = "[\044][\040]+"; OFS = "? \n"; }
($0 !~ /[\041\044]/) {
print; next; }
/[\041]/ {
gsub("[\041][\040]+", "\041 \n"); }
( NF==1 ) || ( $1=$1 )'
As fast as mawk is, a gsub ( ) or $1=$1 still costs money, so skip the costly parts unless it actually has a ? or ! mark.
Last line is the fun trick, done *outside the brace brackets. You've already done the ! the line before, so if no ? found (aka NF is 1), then that one evaluates true, which awk will short circuit and not execute part 2 , simply print.
But if you've found any ? marks, the assignment of $1=$1 will re-arrange them in new order, and because it's an assignment operation not equality-compare, it always come back successful if the assignment itself didn't fail, which will also serve as it self's always-true flag to print towards the end.

Awk's record separator variable RS should do the trick.
echo 'you want to learn shell script? First, you want to learn Linux command! then. you can learn shell script.' |
awk 'BEGIN{RS="[?.!] "}1'

Related

How do I trim whitespace, but not newlines, from subshell output in bash?

There are many tens, maybe a hundred or more previous questions that seem "identical" to this already here, but after extensive search, I found NOTHING that even came close to working - though I did learn quite a lot - and so I decided to just RTFM and figure this out on my own.
The Problem
I wanted to search the output of a ps auxwww command to find processes of interest, and the issue was that I can't just simply use cut to find the exact data from them that I wanted. ps, it turns out, tries to columnate the output, adding either extra spaces or tabs that get in the way of using cut to get the correct data.
So, since I'm not a master at bash, I did a search... The answers I found were all focused on either variables - a "backup strategy" from my point of view that itself didn't solve the whole problem - or they only trimmed leading or trailing space or all "whitespace" including newlines. NOPE, Won't Work For Cut! And, neither will removing trailing newlines and so forth.
So, restated, the question is, how do we efficiently end up with the white space defined as simply a single space between other characters without eliminating newlines?
Below, I will give my answer, but I welcome others to give theirs - who knows, maybe someone has a better answer?!
Answer:
At least MY answer - please leave your own, too! - was to do this:
ps auxwww | grep <program> | tr -s [:blank:] | cut -d ' ' -f <field_of_interest>
This worked great!
Obviously, there are many ways to adapt this to other needs.
As an alternative to all of the pipes and grep with cut, you could simply use awk. The benefit of using awkwith the default field-separator (FS) being set to break on whitespace is that it considers any number of whitespace between fields as a single separator.
So using awk will do away with needing to use tr -s to "squeeze" whitespace to define fields. Further, awk gives far greater control over field matching using regular expressions rather than having to rely on grep of a full line and cut to locate a pre-determined field numbers. (though to some extent you will still have to tell awk what field out of the ps command you are interested in)
Using bash, you can also eliminate the pipe | by using process substitution to send the output of ps auxwww to awk on stdin using redirection, e.g. awk ... < <(ps auxwww) for a single tidy command line.
To get your "program" and "file_of_interest" into awk you have two options. You can initialize awk variables using the -v var=value option (there can be multiple -v otions given), or you can use the BEGIN rule to initialize the variables. The only difference being with -v you can provide a shell variable for value and there is no whitespace allowed surrounding the = sign, while within BEGIN any whitespace is ignored.
So in your case a couple of examples to get the virtual memory size for firefox processes, you could use:
awk -v prog="firefox" -v fnum="5" '
$11 ~ prog {print $fnum}
' < <(ps auxwww)
(above if you had myprog=firefox as a shell variable, you could use -v prog="$myprog" to initialize the prog variable for awk)
or using the BEGIN rule, you could do:
awk 'BEGIN {prog = "firefox"; fnum = "5"}
$11 ~ prog {print $fnum }
' < <(ps auxwww)
In each command above, it locates the COMMAND field from ps (field 11) and checks whether it contains firefox and if so it outputs field no. 5 the virtual memory size used by each process.
Both work fine as one-liners as well, e.g.
awk -v prog="firefox" -v fnum="5" '$11 ~ prog {print $fnum}' < <(ps auxwww)
Don't get me wrong, the pipeline is perfectly fine, it will just be slow. For short commands with limited output there won't be much difference, but when the output is large, awk will provide orders of magnitude improvement over having to tr and grep and cut reading over the same records three times.
The reason being, the pipes and the process on each side requires separate processes be spawned by the shell. So minimizes their use, improves the efficiency of what your script is doing. Now if the data is small as are the processes, there isn't much of a difference. However if you are reading a 3G file 3 times over -- that's is the difference in orders of magnitude. Hours verses minutes or seconds.
I had to use single quotes on CentosOS Linux to get tr working like described above:
ps -o ppid= $$ | tr -d '[:space:]'
You can reduce the number of pipes using this Perl one-liner, which uses Perl regexes instead of a separate grep process. This combines grep, tr and cut in a single command, with an easy way to manipulate the output (#F is the array of fields, 0-indexed):
Examples:
# Start an example process to provide the input for `ps` in the next commands:
/Applications/Emacs.app/Contents/MacOS/Emacs-x86_64-10_14 --geometry 109x65 /tmp/foo &
# Print single space-delimited output of `ps` for all emacs processes:
ps auxwww | perl -lane 'print "#F" if $F[10] =~ /emacs/i'
# Prints:
# bar 72144 0.0 0.5 4610272 82320 s006 SN 11:15AM 0:01.31 /Applications/Emacs.app/Contents/MacOS/Emacs-x86_64-10_14 --geometry 109x65 /tmp/foo
# Print emacs PID and file name opened with emacs:
ps auxwww | perl -lane 'print join "\t", #F[1, -1] if $F[10] =~ /emacs/i'
# Prints:
# 72144 /tmp/foo
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlre: Perl regular expressions (regexes)

Parameter expansion not working when used inside Awk on one of the column entries

System: Linux. Bash 4.
I have the following file, which will be read into a script as a variable:
/path/sample_A.bam A 1
/path/sample_B.bam B 1
/path/sample_C1.bam C 1
/path/sample_C2.bam C 2
I want to append "_string" at the end of the filename of the first column, but before the extension (.bam). It's a bit trickier because of containing the path at the beginning of the name.
Desired output:
/path/sample_A_string.bam A 1
/path/sample_B_string.bam B 1
/path/sample_C1_string.bam C 1
/path/sample_C2_string.bam C 2
My attempt:
I did the following script (I ran: bash script.sh):
List=${1};
awk -F'\t' -vOFS='\t' '{ $1 = "${1%.bam}" "_string.bam" }1' < ${List} ;
And its output was:
${1%.bam}_string.bam
${1%.bam}_string.bam
${1%.bam}_string.bam
${1%.bam}_string.bam
Problem:
I followed the idea of using awk for this substitution as in this thread https://unix.stackexchange.com/questions/148114/how-to-add-words-to-an-existing-column , but the parameter expansion of ${1%.bam} it's clearly not being recognised by AWK as I intend. Does someone know the correct syntax for that part of code? That part was meant to mean "all the first entry of the first column, except the last part of .bam". I used ${1%.bam} because it works in Bash, but AWK it's another language and probably this differs. Thank you!
Note that the paramter expansion you applied on $1 won't apply inside awk as the entire command
body of the awk command is passed in '..' which sends content literally without applying any
shell parsing. Hence the string "${1%.bam}" is passed as-is to the first column.
You can do this completely in Awk
awk -F'\t' 'BEGIN { OFS = FS }{ n=split($1, arr, "."); $1 = arr[1]"_string."arr[2] }1' file
The code basically splits the content of $1 with delimiter . into an array arr in the context of Awk. So the part of the string upto the first . is stored in arr[1] and the subsequent split fields are stored in the next array indices. We re-construct the filename of your choice by concatenating the array entries with the _string in the filename part without extension.
If I understood your requirement correctly, could you please try following.
val="_string"
awk -v value="$val" '{sub(".bam",value"&")} 1' Input_file
Brief explanation: -v value means passing shell variable named val value to awk variable variable here. Then using sub function of awk to substitute string .bam with string value along with .bam value which is denoted by & too. Then mentioning 1 means print edited/non-edtied line.
Why OP's attempt didn't work: Dear, OP. in awk we can't pass variables of shell directly without mentioning them in awk language. So what you are trying will NOT take it as an awk variable rather than it will take it as a string and printing it as it is. I have mentioned in my explanation above how to define shell variables in awk too.
NOTE: In case you have multiple occurences of .bam then please change sub to gsub in above code. Also in case your Input_file is TAB delmited then use awk -F'\t' in above code.
sed -i 's/\.bam/_string\.bam/g' myfile.txt
It's a single line with sed. Just replace the .bam with _string.bam
You can try this way with awk :
awk -v a='_string' 'BEGIN{FS=OFS="."}{$1=$1 a}1' infile

fast alternative to grep file multiple times?

I currently use long piped bash commands to extract data from text files like this, where $f is my file:
result=$(grep "entry t $t " $f | cut -d ' ' -f 5,19 | \
sort -nk2 | tail -n 1 | cut -d ' ' -f 1)
I use a script that might do hundreds of similar searches of $f ,sorting selected lines in various ways depending on what I'm pulling out. I like one-line bash strings with a bunch of pipes because its compact and easy, but it can take forever. Can anyone suggest a faster alternative? Maybe something that loads the whole file into memory first?
Thanks
You might get a boost with doing the whole pipe with gawk or another awk that has asorti by doing:
contents="$(cat "$f")"
result="$(awk -vpattern="entry t $t" '$0 ~ pattern {matches[$5]=$19} END {asorti(matches,inds); print inds[1]}' <<<"$contents")"
This will read "$f" into a variable then we'll use a single awk command (well, gawk anyway) to do all the rest of the work. Here's how that works:
-vpattern="entry t $t": defines an awk variable named pattern that contains the shell variable t
$0 ~ pattern matches the current line against the pattern, if it matches we'll do the part in the braces, otherwise we skip it
matches[$5]=$19 adds an entry to an array (and creates the array if needed) where the key is the 5th field and the value is the 19th
END do the following function after all the input has been processed
asorti(matches,inds) sort the entries of matches such that the inds is an array holding the order of the keys in matches to get the values in sorted order
print inds[1] prints the index in matches (i.e., a $5 from before) associated with the lowest 19th field
<<<"$contents" have awk work on the value in the shell variable contents as though it were a file it was reading
Then you can just update the pattern for each, not have to read the file from disk each time and not need so many extra processes for all the pipes.
You'll have to benchmark to see if it's really faster or not though, and if performance is important you really should think about moving to a "proper" language instead of shell scripting.
Since you haven't provided sample input/output this is just a guess and I only post it because there's other answers already posted that you should not do, so - this may be what you want instead of that one line:
result=$(awk -v t="$t" '
BEGIN { regexp = "entry t " t " " }
$0 ~ regexp {
if ( ($6 > maxKey) || (maxKey == "") ) {
maxKey = $6
maxVal = $5
}
}
END { print maxVal }
' "$f")
I suspect your real performance issue, however, isn't that script but that you are running it and maybe others inside a loop that you haven't shown us. If so, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice and post a better example so we can help you.

Substituting the '*' Character in AWK using 'gsub'

I'm trying to use the AWK in a unix shell script to substitue an instance of one pattern in a file with another and output it to a new file.
Specifically, if the file name is MYFILE.pc, then I'm looking to instances of '*MYFILE' with 'g_MYFILE' (without the quotes). For this, I'm using the gsub function in AWK.
I've successfully got the output file written out and all instances replaced as required, however, the script is also replacing instances of 'MYFILE' (i.e. without the star) with 'g_MYFILE'
Here is the script:
awk -v MODNAM=${OUTPUT_FILE%.pc} '
{
gsub("\*"MODNAM, "g_" MODNAM);
print
}' ${INPUT_FILE} > ${FULL_OUTPUT_FILENAME}
To clarify the script performs the following conversion:
'*MYFILE' --> 'g_MYFILE'
'MYFILE' --> 'g_MYFILE'
I only want the first conversion to be performed. Does anyone have any suggestions?
You may need to double escape the * because you are using a dynamic regexp instead of a regexp constant as the first argument to gsub. See section 3.8 of the GAWK manual for more information.
awk -v MODNAM=${OUTPUT_FILE%.pc} '
{
gsub("\\*"MODNAM, "g_" MODNAM);
print
}' ${INPUT_FILE} > ${FULL_OUTPUT_FILENAME}
Your code actually works in my mawk 1.3.3 and zsh shell. This might be a shell escape issue - have you tried to write the awk script into a file and calling it via -f?
For simple substitutions there is no need for awk at all. Try:
/home/sirch> file="*a*a*a"
/home/sirch> echo ${file//\*/g_}
g_ag_ag_a

IP address and Country on same line with AWK

I'm looking for a one-liner that based on a list of IPs will append the country from where the IP is based
So if I have this as and input:
87.229.123.33
98.12.33.46
192.34.55.123
I'd like to produce this:
87.229.123.33 - GB
98.12.33.46 - DE
192.34.55.123 - US
I've already got a script that returns the country for IP but I need to glue it all together with awk, so far this is waht I came up with:
$ get_ips | nawk '{ print $1; system("ip2country " $1) }'
This is all cool but the ip and the country are not displayed on the same line, how can I merge the system output and the ip on one line ?
If you have a better way of doing this, I'm open to suggestions.
You can use printf instead of print:
{ printf("%s - ", $1); system("ip2country " $1); }
The proper one-liner solution in awk is:
awk '{printf("%s - ", $1) ; system("ip2country \"" $1 "\"")}' < inputfile
However I think it would be much faster if You would use a python program looking like that:
#!/usr/bin/python
# 'apt-get install python-geoip' if needed
import GeoIP
gi = GeoIP.new(GeoIP.GEOIP_MEMORY_CACHE)
for line in file("ips.txt", "r"):
line = line[:-1] # strip the last from the line
print line, "-", gi.country_code_by_addr(line)
As You can see, the geoip object is initialized only once and then it is reused for all queries. See a python binding for geoip. Also be aware that Your awk solution forks a new process 2 times per line!
I don't know how many entries You need to process, but if it's much of it, You should consider something that doesn't fork and keeps the geoip database in memory.
I'll answer with a perl one-liner because I know that syntax better than awk. The "chomp" will cut off the newline that is bothering you.
get_ips | perl -ne 'chomp; print; print `ip2country $_`'

Resources