Bash simple pattern extractor on text

Bash simple pattern extractor on text - bash

I'm stuck on a simple problem of finding a pattern in a string. I've never been comfortable with sed or regex in general.
I'm trying to get the number in the second column in one variable, and the number in the third column in another variable. The numbers are separated by tabs:
Here's what I have now :
while read line
do
middle="$(echo "$line" | sed 's/([0-9]+)\t\([0-9]+\)\t([0-9]+)\\.([0-9]+)/\1/')"
last="$(echo "$line" | sed 's/([0-9]+)\t([0-9]+)\t\([0-9]+)\\.([0-9]+\)/\1/')"
done
Here is the text :
11 1545 0.026666
12 1633 0.025444
13 1597 0.026424
14 1459 0.025634
I know there are simpler tools than 'sed', so feel free to put them to me in response.
Thanks.

This functionality is built into read.
while read first second third more; do
…
done
By default, read splits its input into whitespace-separated fields. Each variable receives one field, except the last one which receives whatever remains on the line. This matches your requirement provided there aren't any empty columns.

Use AWK to save yourself:
while read line
do
middle="$(awk '{print $2}' <<< "$line")"
last="$(awk '{print $3}' <<< "$line")"
done

Related

Delete everything before a pattern

I'm trying to clean a text file.
I want to delete everything start before the first 12 numbers.
1:0:135103079189:0:0:2:0::135103079189:000011:00
A:908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Output desired:
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
Here's my command but seems not working.
sed '/:\([0-9]\{12\}\)/d' t.txt

the d command in sed will delete entire line on matching the given regex, you need to use s command to search and replace only part of line... however, for given problem, sed is not suitable as it doesn't support non-greedy regex
you can use perl instead
$ perl -pe's/^.*?(?=\d{12}:)//' ip.txt
135103079189:0:0:2:0::135103079189:000011:00
908529896240:0:10250:2:0:1:
603307102606:0:0:1:0::01000::M
.*? match zero or more characters as minimally as possible
(?=\d{12}:) only if it is followed by 12-digits ending with :
use perl -i -pe for in-place editing
some possible corner cases
$ # this is matching part of field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe's/^.*?(?=\d{12}:)//'
135103079189:23:603307102606:1
$ # this is not matching 12-digit field at end of line
$ echo 'foo:123:135103079189' | perl -pe's/^.*?(?=\d{12}:)//'
foo:123:135103079189
$ # so, add start/end of line matching cases and restrict 12-digits to whole field
$ echo 'foo:123:abc135103079189:23:603307102606:1' | perl -pe 's/^(?:.*?:)?(?=\d{12}(:|$))//'
603307102606:1
$ echo 'foo:123:135103079189' | perl -pe's/^(?:.*?:)?(?=\d{12}(:|$))//'
135103079189

Could you please try following.
awk --re-interval 'match($0,/[0-9]{12}/){print substr($0,RSTART)}' Input_file
Since I have OLD version of awk so I am using --re-interval you could remove it in case you have new version of it.

This might work for you (GNU sed):
sed -n 's/[0-9]\{12\}/\n&/;s/.*\n//p' file
We only want to print specific lines so use the -n option to turn off automatic printing. If a line contains a 12 digit number, insert a newline before it. Remove any characters before and including a newline and print the result.
If you want to print lines that do not contain a 12 digit number as is, use:
sed 's/[0-9]\{12\}/\n&/;s/.*\n//' file
The crux of the problem is to identify the start of a multi-character string, insert a unique marker and delete all characters before and including the unique marker. As sed uses the newline to delimit lines, only the user can introduce newlines into the pattern space and as a result, newlines will always be unique.

Taking the nice answer from #Sundeep, in case you would like to use grep or pcregrep (macOS/BSD) you could give a try to:
$ grep -oP '^(?:.*?:)?(?=\d{12})\K.*' file
or
$ pcregrep -o '^(?:.*?:)?(?=\d{12})\K.*' file
The \K will ignore everything after the pattern

Alternative thoughts - I almost think your data is too dirty for a quick sed fix but if generally it's all similar to your sample set of data then certainly pick one of the answers with sed etc. However if you wanted to be more particular about it you could build up a set of commands to ensure the values. I like doing this for debugging and when speed isn't urgent.
Take this tiny sample of code, you could do this other ways but I'm getting the value for each part of the string and I know the order because it contiguous. You could then set up controls on which parts to keep and such as it builds out say a new string per line. Overwrought for sure, but sometimes that is a better long term approach.
#!/bin/bash
while IFS= read -r line ;do
IFS=':' read -r -a array <<< "$line"
for ((i=0; i<${#array[#]}; i++)) ;do
echo "part : ${array[$i]}"
done
done < "test_data.txt"
You could then build the data back up how you wanted and more easily understand what's happening every step of the way ..
part : 1
part : 0
part : 135103079189
part : 0
part : 0
part : 2
part : 0
part :
part : 135103079189
part : 000011
part : 00
part : A
part : 908529896240
part : 0

How to loop a variable range in cut command

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument

There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).

Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)

One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

Sed creating duplicates

I have used the command sed in shell to remove everything except for numbers from my string.
Now, my string contains three 0s among other numbers and after running
sed 's/[^0-9]*//g'
Instead of three 0s, i now have 0 01 and 02.
How can I prevent sed from doing that so that I can have the three 0s?
sample of the string:
0 cat
42 dog
24 fish
0 bird
0 tiger
5 fly

Now that we know that digits in filenames in the output from the du utility caused the problem (tip of the hat to Lars Fischer), simply use cut to extract only first column (which contains the data of interest, each file's/subdir.'s size in blocks):
du -a "$var" | cut -f1
du outputs tab-separated data, and a tab is also cut's default separator, so all that is needed is to ask for the 1st field (-f1).
In hindsight, your problem was unrelated to sed; your sample data simply wasn't representative of your actual data. It's always worth creating an MCVE (Minimal, Complete, and Verifiable Example) when asking a question.

try this:
du -a "$var" | sed -e 's/ .*//' -e 's/[^0-9]*//g'

Scripting username creation from text file?

I'm really new at Bash and scripting in general.
I have to create usernames formed of first letter of first name followed by last name. To do it, I use a provided text file that looks like this:
doe,john
smith,mike
...
I declared the following variables:
fname=$(cut -d, -f2 "file.txt" | cut -c1)
lname=$(cut -d, -f1 "file.txt")
But how do I put the elements together to form the names jdoe and msmith ? I tried the methods I know to concatenate strings and vriables, but nothing works..
I think I found a method using awk that is supposed to work, but is there any other way to "concatenate" the elements of 2 lists?
Thank you

There's a million ways to do it, this is simplest:
$ awk -F, '{print substr($2,1,1) $1}' file
jdoe
msmith

Ed Morton's awk-based answer is simplest (and probably fastest), but since you asked for a different solution:
#!/usr/bin/env bash
while IFS=, read -r last first _; do
username=${first:0:1}${last}
echo "username: $username"
done < file.txt
IFS=, read -r last first _ reads the first 2 ,-separated fields from each input line (_ is a dummy variable that receives the rest of the input line, if any; -r prevents interpretation of \ chars. in the input, which is usually what you want).
username=${first:0:1}${last} concatenates the 1st char. of variable $first's value with variable $last's value, simply by placing the two variable references next to each other.
${first:0:1} - extract 1 character from $first at position 0 - is an example of parameter expansion, specifically: substring expansion
< file.txt is an input redirection that sends file.txt's contents via stdin to the while loop.

This looks a bit too much like homework, so I'll just drop some hints.
To read the lastname and firstname into separate variables for each line of the file, see BashFAQ 1. It should not involve cut.
To grab the first character of a variable, see BashFAQ 100.

Use awk to extract value from a line

I have these two lines within a file:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
where I'd like to get the following as output using awk or sed:
3
50000
Using this sed command does not work as I had hoped, and I suspect this is due to the presence of the quotes and delimiters in my line entry.
sed -n '/WORD1/,/WORD2/p' /path/to/file
How can I extract the values I want from the file?

awk -F'[<>]' '{print $3}' input.txt
input.txt:
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
Output:
3
50000

sed -e 's/[a-zA-Z.<\/>= \-]//g' file

Using sed:
sed -E 's/.*limit"*>([0-9]+)<.*/\1/' file
Explanation:
.* takes care of everything that comes before the string limit
limit"* takes care of both the lines, one with limit" and the other one with just limit
([0-9]+) takes care of matching numbers and only numbers as stated in your requirement.
\1 is actually a shortcut for capturing pattern. When a pattern groups all or part of its content into a pair of parentheses, it captures that content and stores it temporarily in memory. For more details, please refer https://www.inkling.com/read/introducing-regular-expressions-michael-fitzgerald-1st/chapter-4/capturing-groups-and

The script solution with parameter expansion:
#!/bin/bash
while read line || test -n "$line" ; do
value="${line%<*}"
printf "%s\n" "${value##*\>}"
done <"$1"
output:
$ ./ltags.sh dat/ltags.txt
3
50000

Looks like XML to me, so assuming it forms part of some valid XML, e.g.
<root>
<first-value system-property="unique.setting.limit">3</first-value>
<second-value-limit>50000</second-value-limit>
</root>
You can use Perl's XML::Simple and do something like this:
perl -MXML::Simple -E '$xml = XMLin("file"); say $xml->{"first-value"}->{"content"}; say $xml->{"second-value-limit"}'
Output:
3
50000
If the XML structure is more complicated, then you may have to drill down a bit deeper to get to the values you want. If that's the case, you should edit the question to show the bigger picture.

Ashkan's awk solution is straightforward, but let me suggest a sed solution that accepts non-integer numbers:
sed -n 's/[^>]*>\([.[:digit:]]*\)<.*/\1/p' input.txt
This extracts the number between the first > character of the line and the following <. In my RE this "number" can be the empty string, if you don't want to accept an empty string please add the -r option to sed and replace \([.[:digit:]]*\) by ([.[:digit:]]+).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Bash simple pattern extractor on text - bash

Use AWK to save yourself: while read line do middle="$(awk '{print $2}' <<< "$line")" last="$(awk '{print $3}' <<< "$line")" done

Related

Delete everything before a pattern

How to loop a variable range in cut command

Sed creating duplicates

Scripting username creation from text file?

Use awk to extract value from a line

Categories

Resources