Bash, retrieving two sets of particular strings from across a text file - bash

consider the example:
Feb 14 26:00:01 randomtext here mail from user10#mailbox.com more random text
Feb 15 25:08:82 randomtext random text mail from user8#mailbox.com more random text
Jan 20 26:23:89 randomtext iortest test test mail from user6#mailbox.com more random
Mar 15 18:23:01 randomtext here mail from user4#mailbox.com more random text
Jun 15 20:04:01 randomtext here mail from user10#mailbox.com more random text
Using BASH I am trying to retrieve the first part of the time stamp for example '26' '25' and the email of the user for example 'user10#mailbox.com'
output would then roughly look like:
26 user10#mailbox.com
25 user8#mailbox.com
26 user6#mailbox.com
18 user4#mailbox.com
20 user10#mailbox.com
I have tried using:
cat myfile | grep -o '[0-9][0-9].*.com'
but it gives me excess text in the middle.
How would i go about retrieving just the two strings i need?

Use sed with capture groups to select the parts you want.
sed 's/^.* \([0-9][0-9]\):.* mail from \(.*#.*\.com\).*/\1 \2/' myfile
^ = beginning of line
.* = any sequence of characters followed by space
\([0-9[0-9]\): = 2 digits followed by a colon. The digits will be saved in capture group #1
.* mail from = any sequence up to a space followed by mail from and another space
\(.*#.*\.com\) = any sequence followed by # followed by any sequence up to .com. This will be saved in capture group #2
.* = any sequence; this will match the rest of the line
Everything this matches (the whole line) will be replaced by capture group #1, a space, and capture group #2.

Try
cat myfile | awk '{print $3, $8}' | sed 's/:[0-9][0-9]//g'
Disclaimer: my awk skills are rusty - there should be a way to do this solely in awk without resorting to sed.

If all your email-addresses will have only domain .com - the previous answer using sed is perfect.
But if you can have different domain, it's better to improve this sed:
sed 's/^.* \([0-9][0-9]\):.* mail from \(.*#.*\..*\)\ more.*/\1 \2/' file

With perl :
$ perl -lne '
print "$1 $2" if /^\w+\s+\d+\s+(\d+):\d+:\d+\s+.*?([-\w\.]+#\S+)/
' file.txt
Output :
26 0#mailbox.com
25 8#mailbox.com
26 6#mailbox.com
18 4#mailbox.com
20 0#mailbox.com

Related

How to write a shell script to swap columns in txt file?

I was trying to solve one of my old assignment I am literally stuck in this one Can anyone help me?
There is a file called "datafile". This file has names of some friends and their
ages. But unfortunately, the names are not in the correct format. They should be
lastname, firstname
But, by mistake they are firstname,lastname
The task of the problem is writing a shell script called fix_datafile
to correct the problem, and sort the names alphabetically. The corrected filename
is called datafile.fix .
Please make sure the original structure of the file should be kept untouched.
The following is the sample of datafile.fix file:
#personal information
#******** Name ********* ***** age *****
Alexanderovich,Franklin 47
Amber,Christine 54
Applesum,Franky 33
Attaboal,Arman 18
Balad,George 38
Balad,Sam 19
Balsamic,Shery 22
Bojack,Steven 33
Chantell,Alex 60
Doyle,Jefry 45
Farland,Pamela 40
Handerman,jimmy 23
Kashman,Jenifer 25
Kasting,Ellen 33
Lorux,Allen 29
Mathis,Johny 26
Maxter,Jefry 31
Newton,Gerisha 40
Osama,Franklin 33
Osana,Gabriel 61
Oxnard,George 20
Palomar,Frank 24
Plomer,Susan 29
Poolank,John 31
Rochester,Benjami 40
Stanock,Verona 38
Tenesik,Gabriel 29
Whelsh,Elsa 21
If you can use awk (I suppose you can), than this there's a script which does what you need:
#!/bin/bash
RESULT_FILE_NAME="datafile.new"
cat datafile.fix | head -4 > datafile.new
cat datafile.fix | tail -n +5 | awk -F"[, ]" '{if(!$2){print()}else{print($2","$1, $3)}}' >> datafile.new
Passing -F"[, ]" allows awk to split columns both by , and space and all that remains is just print columns in a needed format. The downsides are that we should use if statement to preserve empty lines and file header also should be treated separately.
Another option is using sed:
cat datafile.fix | sed -E 's/([a-zA-Z]+),([a-zA-Z]+) ([0-9]+)/\2,\1 \3/g' > datafile.new
The downside is that it requires regex that is not as obvious as awk syntax.
awk -F[,\ ] '
!/^$/ && !/^#/ {
first=$1;
last=$2;
map[first][last]=$0
}
END {
PROCINFO["sorted_in"]="#ind_str_asc";
for (i in map) {
for (j in map[i])
{
print map[i][j]
}
}
}' namesfile > datafile.fix
One liner:
awk -F[,\ ] '!/^$/ && !/^#/ { first=$1;last=$2;map[first][last]=$0 } END { PROCINFO["sorted_in"]="#ind_str_asc";for (i in map) { for (j in map[i]) { print map[i][j] } } }' namesfile > datafile.fix
A solution completely in gawk.
Set the field separator to both , and space. Then ignore any lines that are empty or start with #. Mark the first and last variables based on the delimited fields and then create a two dimensional array called map indexed by first and last name and the value equal to the line. At the end, set the sort to indices string ascending and loop through the array printing the names in order as requested.
Completely in bash:
re="^[[:space:]]*([^#]([[:space:]]|[[:alpha:]])+),(([[:space:]]|[[:alpha:]])*[[:alpha:]]) *([[:digit:]]+)"
while read line
do
if [[ ${line} =~ $re ]]
then
echo ${BASH_REMATCH[3]},${BASH_REMATCH[1]} ${BASH_REMATCH[5]}
else
echo "${line}"
fi
done < names.txt
The core of this is to capture, using bash regex matching (=~ operator of the [[ command), parenthesis groupings, and the BASH_REMATCH array, the name before the comma (([^#]([[:space:]]|[[:alpha:]])+)), the name after the comma ((([[:space:]]|[[:alpha:]])*[[:alpha:]])), and the age ( *([[:digit:]]+)). The first-name regex is constructed so as to exclude comments, and the last-name regex is constructed as to handle multiple spaces before the age without including them in the name. Preconditions: Commented lines with or without leading spaces (^[[:space:]]*([^#]), or lines without a comma, are passed through unchanged. Either first names or last names may have internal spaces. Once the last name and first name are isolated, it is easy to print them in reverse order followed by the age (echo ${BASH_REMATCH[3]},${BASH_REMATCH[1]} ${BASH_REMATCH[5]}). Note that the letter/space groupings are counted as matches which is why we skip 2 and 4.
I have tried using awk and sed.
Try if this works
less dataflie.fix | sed 's/ /,/g' | awk -F "," '{print $2,$1,$3}' | sed 's/ /,/' | sed 's/^,//' | sort -u > dataflie_new.fix

replace string in text file with random characters

So what i'm trying to do is this: I've been using keybr.com to sharpen my typing skills and on this site you can "provide your own custom text." Now i've been taking chapters out of books to type so its a little more interesting than just typing groups of letters. Now I want to also insert numbers into the text. Specifically, between each word have something like "393" and random sets smaller and larger than that example.
so i have saved a chapter of a book into a file in my home folder. Now i just need a command to search for spaces and input a group of numbers and add a space so a sentence would look like this: The 293 dog 328 is 102 black. 334 The... etc.
I have looked up linux commands through search engines and i've found out how to replace strings in text files with:
sed -i 's/original/new/g' file.txt
and how to generate random numbers with:
$ shuf -i MIN-MAX -n COUNT
i just can not figure out how to output a one line command that will have random numbers between each word. I'm still-a-searching so thanks to anyone that takes the time to read my problem.
Perl to the rescue!
perl -pe 's/ /" " . (100 + int rand 900) . " "/ge' < input.txt > output.txt
-p reads the input line by line, after reading a line, it runs the code and prints the line to the output
s/// is similar to the substitution you know from sed
/g means global, i.e. it substitutes as many times as possible
/e means the replacement part is a code to run. In this case, the code generates a random number (100-999).
Given:
$ echo "$txt"
Here is some random words. Please
insert a number a space between each one.
Here is a simple awk to do that:
$ echo "$txt" | awk '{for (i=1;i<=NF;i++) printf "%s %d ", $i, rand()*100; print ""}'
Here 92 is 59 some 30 random 57 words. 74 Please 78
insert 43 a 33 number 77 a 10 space 78 between 83 each 76 one. 49
And here is roughly the same thing in pure Bash:
while read -r line; do
for word in $line; do
printf "%s %s" "$word $((1+$RANDOM % 100))"
done
echo
done < <(echo "$txt")

manipulate text using shell script?

How can i manipulate the text file using shell script?
input
chr2:98602862-98725768
chr11:3100287-3228869
chr10:3588083-3693494
chr2:44976980-45108665
expected output
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Using sed you can write
$ sed 's/chr//; s/[:-]/ /g' file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
Or maybe you could use awk
awk -F "chr|[-:]" '{print $2,$3, $4}' file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665
What it does
-F "chr|[-:]" sets the field separators to chr or : or -. Now you could print the different fields or columns.
You can also use another field separator as -F [^0-9]+ which will makes anything other than digits as separators.
If you don't care about a leading blank char:
$ tr -s -c '[0-9\n]' ' ' < file
2 98602862 98725768
11 3100287 3228869
10 3588083 3693494
2 44976980 45108665

How to use awk to generate fixed width columns with colored input?

I am taking the output from 'ls -l' and passing it through awk to reformat it.
This works:
list=$(ls --color=none -l | tail -n+2)
printf '%s' "$list" | awk '{printf "%-40s more stuff\n", $9}'
It produces something like:
env_profiles more stuff
ls_test.sh more stuff
saddfasfasfdfsafasdf more stuff
test more stuff
But with --color=always it produces:
env_profiles more stuff
ls_test.sh more stuff
saddfasfasfdfsafasdf more stuff
test more stuff
more stuff
"env_profiles" is a directory, "ls_test.sh" is an executable file, so they are both colored and end up with different alignment. Also there is an extra line.
EDIT: Modified answer based on Ed Morton's post. Gets rid of extra line, handles filenames with spaces:
ls --color=always -l | tail -n+2 | awk '
{
$1=$2=$3=$4=$5=$6=$7=$8=""
field = substr($0,9)
nameOnly = $0
gsub(/\x1b[^m]+m/,"",nameOnly)
if( length(field) - length(nameOnly) >= 0 ) {
printf "%-*s more stuff\n", 40 + length(field) - length(nameOnly), field
}
}'
The field ($9) that contains your colored file names starts and ends with control characters to produce the color on your terminal, e.g. in this case foo is colored on the screen but bar is not:
$ cat o1
-rwxr-xr-x 1 emorton Domain Users 21591 Nov 12 2011 foo
-rwxr-xr-x 1 emorton Domain Users 21591 Nov 12 2011 bar
$ cat -v o1
-rwxr-xr-x 1 emorton Domain Users 21591 Nov 12 2011 ^[[01;32mfoo^[[0m
-rwxr-xr-x 1 emorton Domain Users 21591 Nov 12 2011 bar
so when you printf that field in awk and give it a field width of N characters, the color-producing strings are counted as part of the width but then if they are non-printing or backspaces or whatever then the end result will not not show them and it'll look like it's using less space a field that did not contain those characters. Hope that makes sense.
It looks to me like the coloring strings always start with the character \x1b then some coloring instruction and end with m so try this:
$ awk '{
nameOnly = $NF
gsub(/\x1b[^m]+m/,"",nameOnly)
printf "<%-*s>\n", 10 + length($NF) - length(nameOnly), $NF
}' o1
<foo >
<bar >
Note that your approach of using a specific field only works if there's no spaces in the file names
ls --color=always -l | tail -n+2 | awk '{count = gsub(/\x1b/, "\x1b"); if (count == 0) count += 40; else count += 50; printf "%-"count"s more stuff\n", $9}'
Explanation
gsub returns the number of substitutions made on a line. In this case we are substituting the escape character \x1b with itself, storing the number of times it appears.
If there were no escape sequences found (count == 0) we add 40 spaces of padding.
If there were escape sequences on the line (ie. color), we add 40 spaces of padding plus an additional 10 spaces. And of course, count is already equal to the number of escapes.
I found that on my system, if a line has color, it requires 10 more spaces of padding, plus the number of escape characters to match the uncolored lines. For example: normal padding = 40; there were 3 escape sequence on the line; padding should be 40 + 10 + 3 = 53. This may be different on your system and may require adjustment of the numbers.
Finally, we print the line, substituting count for the padding.

Doing multi-staged text manipulation on the command line?

I have a file with a bunch of text in it, separated by newlines:
ex.
"This is sentence 1.\n"
"This is sentence 2.\n"
"This is sentence 3. It has more characters then some other ones.\n"
"This is sentence 4. Again it also has a whole bunch of characters.\n"
I want to be able to use some set of command line tools that will, for each line, count the number of characters in each line, and then, if there are more than X characters per that line, split on periods (".") and then count the number of characters in each element of the split line.
ex. of final output, by line number:
1. 24
2. 24
3. 69: 20, 49 (i.e. "This is sentence 3" has 20 characters, "It has more characters then some other ones" has 49 characters)
wc only takes as input a file name, so I'm having trouble directing it it to take in a text string to do character count on
head -n2 processed.txt | tr "." "\n" | xargs -0 -I line wc -m line
gives me the error: ": open: No such file or directory"
awk is perfect for this. The code below should get you started and you can work out the rest:
awk -F. '{print length($0),NF,length($1)}' yourfile
Output:
23 2 19
23 2 19
68 3 19
70 3 19
It uses a period as the field separator (-F.), prints the length of the whole line ($0), the number of fields (NF), and the length of the first field ($1).
Here is another little example that prints the whole line and the length of each field:
awk -F. '{print $0;for(i=0;i<NF;i++)print length($i)}' yourfile
"This is sentence 1.\n"
23
19
"This is sentence 2.\n"
23
19
"This is sentence 3. It has more characters then some other ones.\n"
68
19
44
"This is sentence 4. Again it also has a whole bunch of characters.\n"
70
19
46
By the way, "wc" can process strings sent to its stdin like this:
echo -n "Hello" | wc -c
5
How about:
head -n2 processed.txt | tr "." "\n" | wc -m line
You should understand better what xargs does and how pipes work. Do google for a good tutorial on those before using them =).
xargs passes each line separately to the next utility. This is not what you want: you want wc to get all the lines here. So just pipe the entire output of tr to it.

Resources