Bash: Variable1 > get first n words > cut > Variable2 - bash

I've read so many entries here now and my head is exploding. Can't find the "right" solution, maybe my bad english is also the reason and for sure my really low skills of bash-stuff.
I'm writing a script, which reads the input of an user (me) into a variable.
read TEXT
echo $TEXT
Hello, this is a sentence with a few words.
What I want is (I'm sure) maybe very simple: I need now the first n words into a second variable. Like
$TEXT tr/csplit/grep/truncate/cut/awk/sed/whatever get the first 5 words > $TEXT2
echo $TEXT2
Hello, this is a sentence
I've used for example ${TEXT:0:10} but this cuts also in the middle of the word. And I don't want to use txt-file-input~outputs, just variables. Is there any really low level, simple solution for it, without losing myself in big, complex code-blocks and hundreds of (/[{*+$'-:%"})]... and so on? :(
Thanks a lot for any support!

Using cut could be a simple solution, but the below solution works too with xargs
firstFiveWords=$(xargs -n 5 <<< "Hello, this is a sentence with a few words." | awk 'NR>1{exit};1')
$ echo $firstFiveWords
Hello, this is a sentence
From the man page of xargs
-n max-args
Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s
option) is exceeded, unless the -x option is given, in which case xargs will exit.
and awk 'NR>1{exit};1' will print the first line from its input.

Related

sed can't replace substring with special characters

[Mac/Terminal] I'm trying to replace words in a sentence with red-colored versions of them. I'm trying to use sed, but it's not outputting the result in the format I'm expecting. i.e.
for w in ${sp}; do
msg=`echo $msg | sed "s/$w/\\033[1;31m$w\\033[0m/g"`
done
results in:
033[1;31mstb033[0m 033[1;31mshu033[0m 033[1;31mkok033[0m
where $sp is a list of a subset of words contained in $msg
the desired output would look like:
\033[1;31mstb\033[0m \033[1;31mshu\033[0m \033[1;31mkok\033[0m
and then my hope would be that echo -e would interpret this correctly and show the red coloring instead. So far, however, I seem to not understand quite correctly how sed works in order to accomplish this.
This seems hugely inefficient. Why do you not simply replace all the words in one go and put in the actual escape codes immediately?
sp='one two three'
msg='one little mouse, two little mice, three little mice'
echo "$msg" | sed -E "s/${sp// /|}/^[[1;31m&^[[0m/g"
Output (where I use bold to mark up the red color1):
one little mouse, two little mice, three little mice
The sed -E option is just to allow us to use a simpler regex syntax (on Linux and some other platforms, try sed -r or simply translate the script to Perl).
You would type ctrl-V esc where you see ^[ in the command line above.
If you need the message in a variable for repeated use, look at printf -v
1 Looks like Stack Overflow doesn't support <span style="color:red">, unfortunately.
What about using an array, and printf instead of echo?
$ sp="Now is the time..."
$ w=( $sp )
$ printf -v output '\e[1;31m%s\e[0m ' "${w[#]}"
$ echo "$output"
Now is the time...
The output is obviously red, which doesn't come across here, but:
$ printf '%q\n' "$output"
$'\E[1;31mNow\E[0m \E[1;31mis\E[0m \E[1;31mthe\E[0m \E[1;31mtime...\E[0m '
And if you don't like the trailing space, you can trim it with ${output% }.

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument
There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).
Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)
One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

get highest number then print next number in new file

I have a file info.txt with pipe delimited, can you give me idea how to get the highest suffix and add entries on it based on the pattern?
info="$HOME/info.txt"
echo "Input the pattern: "
read pattern
awk '/pattern/{ print $0 }' $info >> $HOME/temp1.$$
sed 's/MICRO_AU_FILE//g' $HOME/temp1.$$
##then count highest num but i think not good approach
##if got he highest num then print next number
for ACC_NUM in `cat acc`
do
echo "$pattern-FILE$Highestsufix|server|$ACC_NUM*| >> $HOME/tempfile.$$
cat $HOME/tempfile.$$ >> $info
done
fi
info.txt
MICRO_AU-FILE01|serve|12345
MICRO_AU-FILE02|serve|23456
MICRO_AU-FILE04|serve|34534
MICRO_PH-FILE01|serve|56457
MICRO_PH-FILE02|serve|12345
MICRO_BN-FILE01|serve|78564
MICRO_BN-FILE03|serve|45267
acc
11111
22222
output: if my pattern is MICRO_AU
MICRO_AU-FILE01|serve|12345
MICRO_AU-FILE02|serve|23456
MICRO_AU-FILE04|serve|34534
MICRO_PL-FILE01|serve|56457
MICRO_PL-FILE02|serve|12345
MICRO_BN-FILE01|serve|78564
MICRO_BN-FILE03|serve|45267
MICRO_AU-FILE05|serve|11111
MICRO_AU-FILE06|serve|22222
I would extract the suffixes, sort them ascending numerically, and take the highest one. If the input is as regular as in the example, this would be simply
HIGHEST_INDEX=$(cut -c 14,15|sort -nr|head -n 1)
If the structure of the lines can vary, you would have to adapt the number selector (cut -c 14,15) according to your tast.
UPDATE: I just see, that you have tagged your question with shell and not with bash, zsh, or ksh. If you need your program to run also on Bourne Shell, you have to use
HIGHEST_INDEX=`cut -c 14,15|sort -nr|head -n 1`
In general, it is best with this type of question, if you explicitly state, on which shell(s) your program should run. The more specific you are in this respect, the better solution we can suggest. For example, getting the next higher number (after HIGHEST_INDEX) is more complicated in Bourne shell as in the other ones.

WC on OSX - Return includes spaces

When I run the word count command in OSX terminal like wc -c file.txt I get the below answer that includes spaces padded before the answer. Does anyone know why this happens, or how I can prevent it?
18000 file.txt
I would expect to get:
18000 file.txt
This occurs using bash or bourne shell.
The POSIX standard for wc may be read to imply that there are no leading blanks, but does not say that explicitly. Standards are like that.
This is what it says:
By default, the standard output shall contain an entry for each input file of the form:
"%d %d %d %s\n", <newlines>, <words>, <bytes>, <file>
and does not mention the formats for the single-column options such as -c.
A quick check shows me that AIX, OSX, Solaris use a format which specifies the number of digits for the value — to align columns (and differ in the number of digits). HPUX and Linux do not.
So it is just an implementation detail.
I suppose it is a way of getting outputs to line up nicely, and as far as I know there is no option to wc which fine tunes the output format.
You could get rid of them pretty easily by piping through sed 's/^ *//', for example.
There may be an even simpler solution, depending on why you want to get rid of them.
At least under macOS/bash wc exhibits the behavior of outputting trailing positional TABs.
It can be avoided using expr:
echo -n "some words" | expr $(wc -c)
>> 10
echo -n "some words" | expr $(wc -w)
>> 2
Note: The -n prevents echoing a newline character which would count as 1 in wc -c
This bugs me every time I write a script that counts lines or characters. I wish that wc were defined not to emit the extra spaces, but it's not, so we're stuck with them.
When I write a script, instead of
nlines=`wc -l $file`
I always say
nlines=`wc -l < $file`
so that wc's output doesn't include the filename, but that doesn't help with the extra spaces. The trick I use next is to add 0 to the number, like this:
nlines=`expr $nlines + 0` # get rid of trailing spaces

BASH Palindrome Checker

This is my first time posting on here so bear with me please.
I received a bash assignment but my professor is completely unhelpful and so are his notes.
Our assignment is to filter and print out palindromes from a file. In this case, the directory is:
/usr/share/dict/words
The word lengths range from 3 to 45 and are supposed to only filter lowercase letters (the dictionary given has characters and uppercases, as well as lowercase letters). i.e. "-dkas-das" so something like "q-evvavve-q" may count as a palindrome but i shouldn't be getting that as a proper result.
Anyways, I can get it to filter out x amount of words and return (not filtering only lowercase though).
grep "^...$" /usr/share/dict/words |
grep "\(.\).\1"
And I can use subsequent lines for 5 letter words and 7 and so on:
grep "^.....$" /usr/share/dict/words |
grep "\(.\)\(.\).\2\1"
But the prof does not want that. We are supposed to use a loop. I get the concept but I don't know the syntax, and like I said, the notes are very unhelpful.
What I tried was setting variables x=... and y=.. and in a while loop, having x=$x$y but that didn't work (syntax error) and neither did x+=..
Any help is appreciated. Even getting my non-lowercase letters filtered out.
Thanks!
EDIT:
If you're providing a solution or a hint to a solution, the simplest method is prefered.
Preferably one that uses 2 grep statements and a loop.
Thanks again.
Like this:
for word in `grep -E '^[a-z]{3,45}$' /usr/share/dict/words`;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Output using my dictionary:
aha
bib
bob
boob
...
wow
Update
As pointed out in the comments, reading in most of the dictionary into a variable in the for loop might not be the most efficient, and risks triggering errors in some shells. Here's an updated version:
grep -E '^[a-z]{3,45}$' /usr/share/dict/words | while read -r word;
do [ $word == `echo $word | rev` ] && echo $word;
done;
Why use grep? Bash will happily do that for you:
#!/bin/bash
is_pal() {
local w=$1
while (( ${#w} > 1 )); do
[[ ${w:0:1} = ${w: -1} ]] || return 1
w=${w:1:-1}
done
}
while read word; do
is_pal "$word" && echo "$word"
done
Save this as banana, chmod +x banana and enjoy:
./banana < /usr/share/dict/words
If you only want to keep the words with at least three characters:
grep ... /usr/share/dict/words | ./banana
If you only want to keep the words that only contain lowercase and have at least three letters:
grep '^[[:lower:]]\{3,\}$' /usr/share/dict/words | ./banana
The multiple greps are wasteful. You can simply do
grep -E '^([a-z])[a-z]\1$' /usr/share/dict/words
in one fell swoop, and similarly, put the expressions on grep's standard input like this:
echo '^([a-z])[a-z]\1$
^([a-z])([a-z])\2\1$
^([a-z])([a-z])[a-z]\2\1$' | grep -E -f - /usr/share/dict/words
However, regular grep does not permit backreferences beyond \9. With grep -P you can use double-digit backreferences, too.
The following script constructs the entire expression in a loop. Unfortunately, grep -P does not allow for the -f option, so we build a big thumpin' variable to hold the pattern. Then we can actually also simplify to a single pattern of the form ^(.)(?:.|(.)(?:.|(.)....\3)?\2?\1$, except we use [a-z] instead of . to restrict to just lowercase.
head=''
tail=''
for i in $(seq 1 22); do
head="$head([a-z])(?:[a-z]|"
tail="\\$i${tail:+)?}$tail"
done
grep -P "^${head%|})?$tail$" /usr/share/dict/words
The single grep should be a lot faster than individually invoking grep 22 or 43 times on the large input file. If you want to sort by length, just add that as a filter at the end of the pipeline; it should still be way faster than multiple passes over the entire dictionary.
The expression ${tail+:)?} evaluates to a closing parenthesis and question mark only when tail is non-empty, which is a convenient way to force the \1 back-reference to be non-optional. Somewhat similarly, ${head%|} trims the final alternation operator from the ultimate value of $head.
Ok here is something to get you started:
I suggest to use the plan you have above, just generate the number of "." using a for loop.
This question will explain how to make a for loop from 3 to 45:
How do I iterate over a range of numbers defined by variables in Bash?
for i in {3..45};
do
* put your code above here *
done
Now you just need to figure out how to make "i" number of dots "." in your first grep and you are done.
Also, look into sed, it can nuke the non-lowercase answers for you..
Another solution that uses a Perl-compatible regular expressions (PCRE) with recursion, heavily inspired by this answer:
grep -P '^(?:([a-z])(?=[a-z]*(\1(?(2)\2))$))++[a-z]?\2?$' /usr/share/dict/words

Resources