Bash: add multiple lines after a first occurene of matched pattern in the file - bash

How to acheive this with awk/sed?
Input:
zero
one
two
three
four
output:
zero
one
one-one
one-two
one-three
two
three
four
Note: I need actual tab to be included in the new lines to be added.

With GNU sed, you can use the a\ command to append lines after a match (or i\ to insert lines before a match.
sed '/one/a\ \tone-one\n\tone-two\n\tone-three' file
zero
one
one-one
one-two
one-three
two
three
four

The title states 'after the first occurrence' (I presume occurene is a typo), however, other answers don't seem to cater this requirement and due to the unique nature of the sample set, it is not that obvious when you test.
If we change the sample set to
zero
one
three
one
four
five
one
six
seven
one
Then we would need something like awk '/one/ && !x {print $0; print "\tone-one\n\tone-two\n\tone-three"; x=1;next} 1', which produces
zero
one
one-one
one-two
one-three
two
three
one
four
five
one
six
seven
one
Actually, this and that answers provide some more options as well.

Using awk:
awk '1; /one/{print "\n\tone-one\n\tone-two\n\tone-three"}' file
zero
one
one-one
one-two
one-three
two
three
four

Related

How to print unique values in order of appearance?

I'm trying to get the unique values
from the list below, but leaving the
unique values in the original order.
This is, the order of appearance.
group
swamp
group
hands
swamp
pipes
group
bellyful
pipes
swamp
emotion
swamp
pipes
bellyful
after
bellyful
I've tried combining sort and uniq commands but the output is sorted alphabetically, and if I don't use sort command, uniq command doesn't work.
$ sort file | uniq
after
bellyful
emotion
group
hands
pipes
swamp
and my desiree output would be like this
group
swamp
hands
pipes
bellyful
emotion
after
How can I do this?
A short, jam-packed awk invocation will get the job done. We'll create an associative array and count every time we've seen a word:
$ awk '!count[$0]++' file
group
swamp
hands
pipes
bellyful
emotion
after
Explanation:
Awk processes the file one line at a time and $0 is the current line.
count is an associative array mapping lines to the number of times we've seen them. Awk doesn't mind us accessing uninitialized variables. It automatically makes count an array and sets the elements to 0 when we first access them.
We increment the count each time we see a particular line.
We want the overall expression to evaluate to true the first time we see a word, and false every successive time. When it's true, the line is printed. When it's false, the line is ignored. The first time we see a word count[$0] is 0, and we negate it to !0 == 1. If we see the word again count[$0] is positive, and negating that gives 0.
Why does true mean the line is printed? The general syntax we're using is expr { actions; }. When the expression is true the actions are taken. But the actions can be omitted; the default action if we don't write one is { print; }.

Removing blankspace at the start of a line (size of blankspace is not constant)

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

Grabbing and splitting specific lines with one or more instances

Given a .txt file (DNA-sequence alignment report), in this format:
5463784 reads; of these:
5463784 (100.00%) were paired; of these:
841569 (15.40%) aligned concordantly 0 times
4469608 (81.80%) aligned concordantly exactly 1 time
152607 (2.79%) aligned concordantly >1 times
----
841569 pairs aligned 0 times concordantly or discordantly; of these:
1683138 mates make up the pairs; of these:
1407028 (83.60%) aligned 0 times
226521 (13.46%) aligned exactly 1 time
49589 (2.95%) aligned >1 times
87.12% overall alignment rate
What is the easiest and shortest way to grab subportions of specific lines? For example if I wanted to grab the 'exactly' lines I can use:
awk '/exactly/{print}'
Which would return:
4469608 (81.80%) aligned concordantly exactly 1 time
226521 (13.46%) aligned exactly 1 time
But i'm not sure how to then split what's returned to obtain 4469608 and 226521 within an array (to then eventually sum together) to give a variable set to 4696129.
awk '/exactly/ {sum=sum+$1;}END{print sum}' dna
Take actions on those line where exactly is present, store the value of first column in a awk variable called sum and in the end print.

shell: What the means of number of sentence

I need to count number of sentences and paragraphs but I do not understand how to do this from a text file.
I can count the number of lines and words using the wc command but I do not understand the meaning of sentence and paragraph in text file. Is there any command in shell do this?
Here's how we count number of words and lines in a text file:
wc -w filename
wc -l filename
For sentences and paragraphs, here is what I tried:
grep -c \\. #to count number of sentences.
grep -o [.'\n'] #to count number of paragraph.
I do not understand how to count number of sentences and paragraphs in a text file.
Any ideas will be helpful.
for example:
Main article: SSID#Security of SSID hiding.
A simple but ineffective method to attempt to secure a wireless network is SSID (Service Set Identifier).[12][13] This provides very little protection against anything but the most casual intrusion efforts...
2 paragraph,and 3 sentence.
A first approximation can be obtained under the assumptions that:
Sentences end with a period and periods are only used for that (no
decimal numbers, no ellipsis, etc.)
Paragraphs are separated with exactly one empty line
(Of course those are not met in reality but it should get you started)
grep -oc \\.
will count the number of sentences, and
grep -c "^$"
will count the number of paragraphs. If your text is strongly formatted you may get to something that works, otherwise, you could consider using Natural Language Processing tools such as NLTK.
To count the number of sentences, you could count the number of peroids, question marks, and exclamation points. But then you run into the problem of an ellipsis (...). I suppose you could only count it if it has whitespace afterwards.
Paragraphs are another matter. Are they indented? How, with a tab? Then count them.
The big question is 'What is the delimiter between sentences and paragraphs?'
When you know that, define the delimiter regex, and count how many are in the file using the tool of your choice.

which one is suitable datastructure

Two files each of size terabytes. A file comparison tool compares ith line of file1 with
i th line of file2. if they are same it prints. which datastructure is suitable.
B-tree
linked list
hash tables
none of them
You need to be able to buffer up at LEAST a line at a time. Here's one way:
While neither file is at EOF:
Read lines A and B from files one and two (each)
If lines are identical, print one of them
Translate into suitable programming language, and problem is solved.
Note that no fancy data structures are involved.
the simple logic is read one line at a time from the file and match..
It's like
While line1 is not equal to EOF file1 and line2 is not equal to EOF file2:
Compare line1 and line2
Btw you have to be sure how much maximum character a line can contain so u can change buffer size accordingly..
Otherwise try bigdata concept spark framework to make your work easier.

Resources