How to print unique values in order of appearance? - bash

I'm trying to get the unique values
from the list below, but leaving the
unique values in the original order.
This is, the order of appearance.
group
swamp
group
hands
swamp
pipes
group
bellyful
pipes
swamp
emotion
swamp
pipes
bellyful
after
bellyful
I've tried combining sort and uniq commands but the output is sorted alphabetically, and if I don't use sort command, uniq command doesn't work.
$ sort file | uniq
after
bellyful
emotion
group
hands
pipes
swamp
and my desiree output would be like this
group
swamp
hands
pipes
bellyful
emotion
after
How can I do this?

A short, jam-packed awk invocation will get the job done. We'll create an associative array and count every time we've seen a word:
$ awk '!count[$0]++' file
group
swamp
hands
pipes
bellyful
emotion
after
Explanation:
Awk processes the file one line at a time and $0 is the current line.
count is an associative array mapping lines to the number of times we've seen them. Awk doesn't mind us accessing uninitialized variables. It automatically makes count an array and sets the elements to 0 when we first access them.
We increment the count each time we see a particular line.
We want the overall expression to evaluate to true the first time we see a word, and false every successive time. When it's true, the line is printed. When it's false, the line is ignored. The first time we see a word count[$0] is 0, and we negate it to !0 == 1. If we see the word again count[$0] is positive, and negating that gives 0.
Why does true mean the line is printed? The general syntax we're using is expr { actions; }. When the expression is true the actions are taken. But the actions can be omitted; the default action if we don't write one is { print; }.

Related

How to determine the number of grouped numbers in a string in bash

I have a string in bash
string="123 abc 456"
Where numbers that are grouped together are considered 1 number.
"123" and "456" would be considered numbers in this case.
How can i determine the number of grouped together numbers?
so
"123"
is determined to be a string with just one number, and
"123 abc 456"
is determined to be a string with 2 numbers.
egrep -o '[0-9]+' <<<"$string" | wc -l
Explanation
egrep: This performs an extended regular expression match on the lines of a given file (or, in this case, a herestring). It usually returns lines of text within the string that contain at least one chunk of text that matches the supplied pattern. However, the -o flag tells it to return only those matching chunks, one per line of output.
'[0-9]+': This is the regular expression that the string is compared against. Here, we are telling it to match successive runs of 1 or more digits, and no other character.
<<< The herestring operator allows us to pass a string into a command as if were the contents of a file.
| This pipes the output of the previous command (egrep) to become the input for the next command (wc).
wc: This performs a word count, normally returning the number of words in a given argument. However, the -l tells it to do a line count instead.
UPDATE: 2018-08-23
Is there any way to adapt your solution to work with floats?
The regular expression that matches both integer numbers and floating point decimal numbers would be something like this: '[0-9]*\.?[0-9]+'. Inserting this into the command above in place of its predecessor, forms this command chain:
egrep -o '[0-9]*\.?[0-9]+' <<<"$string" | wc -l
Focussing now only on the regular expression, here's how it works:
[0-9]: This matches any single digit from 0 to 9.
*: This is an operator that applies to the expression that comes directly before it, i.e. the [0-9] character class. It tells the search engine to match any number of occurrences of the digits 0 to 9 instead of just one, but no other character. Therefore, it will match "2", "26", "4839583", ... but it will not match "9.99" as a singular entity (but will, of course, match the "9" and the "99" that feature within it). As the * operator matches any number of successive digits, this can include zero occurrences (this will become relevant later).
\.: This matches a singular occurrence of a period (or decimal point), ".". The backslash is a special character that tells the search engine to interpret the period as a literal period, because this character itself has special function in regular expression strings, acting as a wildcard to match any character except a line-break. Without the backslash, that's what it would do, which would potentially match "28s" if it came across it, where the "s" was caught by the wildcard period. However, the backslash removes the wildcard functionality, so it will now only match with an actual period.
?: Another operator, like the *, except this one tells the search engine to match the previous expression either zero or one times, but no more. In other words, it makes the decimal point optional.
[0-9]+: As before, this will match digits 0 to 9, the number of which here is determined by the + operator, which standards for at least one, i.e. one or more digits.
Applying this to the following string:
"The value of pi is approximately 3.14159. The value of e is about 2.71828. The Golden Ratio is approximately 1.61803, which can be expressed as (√5 + 1)/2."
yields the following matches (one per line):
3.14159
2.71828
1.61803
5
1
2
And when this is piped through the wc -l command, returns a count of the lines, which is 6, i.e. the supplied string contains 6 occurrences of number strings, which includes integers and floating point decimals.
If you wanted only the floating point decimals, and to exclude the integers, the regular expression is this:
'[0-9]*\.[0-9]+'
If you look carefully, it's identical to the previous regular expression, except for the missing ? operator. If you recall, the ? made the decimal point an optional feature to match; removing this operator now means the decimal point must be present. Likewise, the + operator is matching at least one instance of a digit following the decimal point. However, the * operator before it matches any number of digits, including zero digits. Therefore, "0.61803" would be a valid match (if it were present in the string, which it isn't), and ".33333" would also be a valid match, since the digits before the decimal point needn't be there thanks to the * operator. However, whilst "1.1111" could be a valid match, "1111." would not be, because + operator dictates that there must be at least one digit following the decimal point.
Putting it into the command chain:
egrep -o '[0-9]*\.[0-9]+' <<<"$string" | wc -l
returns a value of 3, for the three floating point decimals occurring in the string, which, if you remove the | wc -l portion of the command, you will see in the terminal output as:
3.14159
2.71828
1.61803
For reasons I won't go into, matching integers exclusively and excluding floating point decimals is harder to accomplish with Perl-flavoured regular expression matching (which egrep is not). However, since you're really only interested in the number of these occurrences, rather than the matches themselves, we can create a regular expression that doesn't need to worry about accurate matching of integers, as long as it produces the same number of matched items. This expression:
'[^.0-9][0-9]+(\.([^0-9]|$)|[^.])'
seems to be good enough for counting the integers in the string, which includes the 5, 1 and 2 (ignoring, of course, the √ symbol), returning these approximately matches substrings:
√5
1)
/2.
I haven't tested it that thoroughly, however, and only formulated it tonight when I read your comment. But, hopefully, you are beginning to get a rough sense of what's going on.
In case you need to know the number of grouped digits in string then following may help you.
string="123 abc 456"
echo "$string" | awk '{print gsub(/[0-9]+/,"")}'
Explanation: Adding explanation too here, following is only for explanation purposes.
string="123 abc 456" ##Creating string named string with value of 123 abc 456.
echo "$string" ##Printing value of string here with echo.
| ##Putting its output as input to awk command.
awk '{ ##Initializing awk command here.
print gsub(/[0-9]+/,"") ##printing value of gsub here(where gsub is for substituting the values of all digits in group with ""(NULL)).
it will globally substitute the digits and give its count(how many substitutions happens will be equal to group of digits present).
}' ##Closing awk command here.

Removing blankspace at the start of a line (size of blankspace is not constant)

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

shell: What the means of number of sentence

I need to count number of sentences and paragraphs but I do not understand how to do this from a text file.
I can count the number of lines and words using the wc command but I do not understand the meaning of sentence and paragraph in text file. Is there any command in shell do this?
Here's how we count number of words and lines in a text file:
wc -w filename
wc -l filename
For sentences and paragraphs, here is what I tried:
grep -c \\. #to count number of sentences.
grep -o [.'\n'] #to count number of paragraph.
I do not understand how to count number of sentences and paragraphs in a text file.
Any ideas will be helpful.
for example:
Main article: SSID#Security of SSID hiding.
A simple but ineffective method to attempt to secure a wireless network is SSID (Service Set Identifier).[12][13] This provides very little protection against anything but the most casual intrusion efforts...
2 paragraph,and 3 sentence.
A first approximation can be obtained under the assumptions that:
Sentences end with a period and periods are only used for that (no
decimal numbers, no ellipsis, etc.)
Paragraphs are separated with exactly one empty line
(Of course those are not met in reality but it should get you started)
grep -oc \\.
will count the number of sentences, and
grep -c "^$"
will count the number of paragraphs. If your text is strongly formatted you may get to something that works, otherwise, you could consider using Natural Language Processing tools such as NLTK.
To count the number of sentences, you could count the number of peroids, question marks, and exclamation points. But then you run into the problem of an ellipsis (...). I suppose you could only count it if it has whitespace afterwards.
Paragraphs are another matter. Are they indented? How, with a tab? Then count them.
The big question is 'What is the delimiter between sentences and paragraphs?'
When you know that, define the delimiter regex, and count how many are in the file using the tool of your choice.

Bash: add multiple lines after a first occurene of matched pattern in the file

How to acheive this with awk/sed?
Input:
zero
one
two
three
four
output:
zero
one
one-one
one-two
one-three
two
three
four
Note: I need actual tab to be included in the new lines to be added.
With GNU sed, you can use the a\ command to append lines after a match (or i\ to insert lines before a match.
sed '/one/a\ \tone-one\n\tone-two\n\tone-three' file
zero
one
one-one
one-two
one-three
two
three
four
The title states 'after the first occurrence' (I presume occurene is a typo), however, other answers don't seem to cater this requirement and due to the unique nature of the sample set, it is not that obvious when you test.
If we change the sample set to
zero
one
three
one
four
five
one
six
seven
one
Then we would need something like awk '/one/ && !x {print $0; print "\tone-one\n\tone-two\n\tone-three"; x=1;next} 1', which produces
zero
one
one-one
one-two
one-three
two
three
one
four
five
one
six
seven
one
Actually, this and that answers provide some more options as well.
Using awk:
awk '1; /one/{print "\n\tone-one\n\tone-two\n\tone-three"}' file
zero
one
one-one
one-two
one-three
two
three
four

Sorting lines with numbers and word characters

I recently wrote a simple utility in Perl to count words in a file to determinate its frequency, that's how many times it appears.
It's all fine, but I'd like to sort the result to make it easier to read. An output example would be:
4:an
2:but
5:does
10:end
2:etc
2:for
As you can see, it's ordered by word, not frequency. But with a little help of :sort I could reorganize that. Using n, numbers like 10 go to the right place (even though it start with 1), plus a little ! and the order gets reversed, so the word that appears more is the first one.
:sort! n
10:end
5:does
4:an
2:for
2:etc
2:but
The problem is: when the number is repeated it gets sorted by word — which is nice — but remember, the order was reversed!
for -> etc -> but
How can I fix that? Will I have to use some Vim scripting to iterate over each line checking whether it starts with the previous number, and marking relevant lines to sort them after the number changes?
tac | sort -nr
does this, so select the lines with shift+V and use !
From the vim :help sort:
The details about sorting depend on the library function used. There is no
guarantee that sorting is "stable" or obeys the current locale. You will have
to try it out.
As a result, you might want to perform the sorting in your Perl script instead; it wouldn't be hard to extend your Perl sort to be stable, see perldoc sort for entirely too many details.
If you just want this problem finished, then you can replace your :sort command with this:
!sort -rn --stable (it might be easiest to use Shift-V to visually select the lines first, or use a range for the sort, or something similar, but if you're writing vim scripts, none of this will be news to you. :)

Resources