extract string conditionally from variable column sized text file - bash

From a text file with variable number of columns per row (tab delimited), I would like to extract value with specific condition.
The text file looks like:
S1=dhs Sb=skf S3=ghw QS=ghr</b>
S1=dhf QS=thg S3=eiq<b/>
QS=bhf S3=ruq Gq=qpq GW=tut<b/>
Sb=ruw QS=ooe Gq=qfj GW=uvd<b/>
I would like to have a result like:
QS=ghr<b/>
QS=thg<b/>
QS=bhf<b/>
QS=ooe
Please excuse my naive question but I am a beginner trying to learn some basic bash scripting technique for text manipulation.
Thanks in advance!

You could use awk ,
awk '{for(i=1;i<=NF;i++){if($i~/^QS=/){print $i}}}' file
This awk command iterates through each fields and check for the column which has QS= string at the start. If it finds any, then the corresponding column would be printed.
Through grep,
grep -oP '(^|\t)\KQS=\S*' file
-o parameter means only matching. So it prints only the characters which are matched.
-P this enables the Perl-regex mode.
(^|\t) matches the start of a line or a tab character.
\K discards the previously matched tab or start of the line boundary.
QS= Now it matches the QS= string.
\S* Matches zero or more non-space characters.

Related

Extract a substring (value of an HTML node tag) in a bash/zsh script

I'm trying to extract a tag value of an HTML node that I already have in a variable.
I'm currently using Zsh but I'm trying to make it work in Bash as well.
The current variable has the value:
<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>
and I would like to get the value of data-count (in this case 0, but could be any length integer).
I have tried using cut, sed and the variables expansion as explained in this question but I haven't managed to adapt the regexs, or maybe it has to be done differently for Zsh.
There is no reason why sed would not work in this situation. For your specific case, I would do something like this:
sed 's/.*data-count="\([0-9]*\)".*/\1/g' file_name.txt
Basically, it just states that sed is looking for the a pattern that contains data-count=, then saves everything within the paranthesis \(...\) into \1, which is subsequently printed in place of the match (full line due to the .*)
Could you please try following.
awk 'match($0,/data-count=[^ ]*/){print substr($0,RSTART+12,RLENGTH-13)}' Input_file
Explanation: Using match function of awk to match regex data-count=[^ ]* means match everything from data-count till a space comes, if this regex is TRUE(a match is found) then out of the box variables RSTART and RLENGTH will be set. Later I am printing current line's sub-string as per these variables values to get only value of data-count.
With sed could you please try following.
sed 's/.*data-count=\"\([^"]*\).*/\1/' Input_file
Explanation: Using sed's capability of group referencing and saving regex value in first group after data-count=\" which is its length, then since using s(substitution) with sed so mentioning 1 will replace all with \1(which is matched regex value in temporary memory, group referencing).
As was said before, to be on the safe side and handle any syntactically valid HTML tag, a parser would be strongly advised. But if you know in advance, what the general format of your HTML element will look like, the following hack might come handy:
Assume that your variable is called "html"
html='<span class="alter" fill="#ffedf0" data-count="0" data-more="none"/>'
First adapt it a bit:
htmlx="tag ${html%??}"
This will add the string tag in front and remove the final />
Now make an associative array:
declare -A fields
fields=( ${=$(tr = ' ' <<<$htmlx)} )
The tr turns the equal sign into a space and the ${= handles word splitting. You can now access the values of your attributes by, say,
echo $fields[data-count]
Note that this still has the surrounding double quotes. Yuo can easily remove them by
echo ${${fields[data-count]%?}#?}
Of course, once you do this hack, you have access to all attributes in the same way.

How do I create pattern of kmer in unix for a given string?

I have a string called mystring=AACTCGCTTT. I want to create a pattern of this string allowing 4 mismatches or kmer= 6 starting from the first letter and ending to the last last letter. I want this so I can grep these patterns in a text file. How do I do that in bash? So my pattern would look like this:
????CGCTTT
A????GCTTT
AA?T???TTT
There is a tool called agrep for that purpose:
agrep -4 AACTCGCTTT filename
From the man page:
Searches for approximate matches of PATTERN in each FILE or standard input. Example: 'agrep -2 optimize foo.txt' outputs all lines in file 'foo.txt' that match "optimize" within two errors. E.g. lines which contain "optimise", "optmise", and "opitmize" all match.

Find position of the first occurence of a substring in a file

I have a very large file, which is made of only one line (no CR at all).
I have several occurences of the same pattern (let's say here , the pattern is ABCDE).
I want to return the starting position or the starting column of the first character of the first occurence of this pattern...
for example, if this is the data in the file :
123456ABCDEF456987ABCDEFjhkhkhkhABCDEF
I want to return 7 as the starting column of the first occurence of the pattern...
thanks community :-)
Use awk index() function:
awk -v pattern="ABCDE" '{print index($0,pattern)}' file
Use the "C" option of "split", so there will be no need to repair the files afterwards.
-C, --line-bytes=SIZE
put at most SIZE bytes of lines per output file

Sed keep original indentation and camel-casing a variable

I have a simple sed script and I am replacing a bunch of lines in my application dynamically with a variable, the variable is a list of strings.My function works but does not keep the original indentation.the function deletes the line if it contains the certain string and replaces the line with a completely new line, I could not do a replace due to certain syntax restrictions.
How do I keep my original indentation when the line is replaced
Can I capitalize my variable and remove the underscore on the fly, i.e. the title is a capitalize and underscore removed version of the variableName, the list of items in the variable array is really long so I am trying to do this in one shot.
Ex: I want report_type -> Report Type done mid process
Is there a better way to solve this with sed? Thanks for any inputs much appreciated.
sed function is as follows
variableName=$1
sed -i "/name\=\"${variableName}\.name\" value\=model\.${variableName}\.name options\=\#lists\./c\\{\{\> \_dropdown title\=\"${variableName}\" required\=true name\=\"${variableName}\"\}\}" test
SAMPLE INPUT
{{> _select title="Report Type" required=true name="report_type.name" value=model.report_type.name options=#lists.report_type}}
SAMPLE EXPECTED OUPUT
{{> _dropdown title="Report Type" required=true name="report_type" value=model.report_type.name}}
sample input variable
report_type
Try this:
sed -E "s/^(\s+).*name\=\"(report_type)\.name\" value\=model\.report_type\.name options\=\#lists\..*$/\1\{\{\> \_dropdown title\=\"\2\" required\=true name\=\"\2\"\}\}/;T;s/\"(\w+)_(\w+)\"/\"\u\1 \u\2\"/g" input.txt > output.txt
I used "report_type" instead of ${variableName} for testing as an sed one-liner.
Please change back to ${variableName}.
Then go back to using -i (in addition to -E, which is for extended regex).
I am not sure whether I can do it without extended regex, let me know if that is necessary.
use s/// to replace fine tuned line
first capture group for the white space making the indentation
second capture group for the variable name
stop if that did not replace anything, T;
another s///
look for something consisting of only letters between "",
with a "_" between two parts,
seems safe enough because this step is only done on the already replaced line
replace by two parts, without "_"
\u for making camel case
Note:
Doing this on your sample input creates two very similar lines.
I assume that is intentional. Otherwise please provide desired output.
Using GNU sed version 4.2.1.
Interesting line of output:
{{> _dropdown title="Report Type" required=true name="Report Type"}}

Sed or String Replace command in Unix to change last First Character after Sequence to UpperCase

I basically have these xml files where I need to change the first alphabet after
Eg.
Result:
I tried: sed 's/<structure name=\"/\U\/g'
However, this changes the entire word to uppercase. Can someone help me out?
\U is for converting all characters. You will need to use \u to convert the first occurrence.
Also, you will need to group them to ensure correct letter is converted:
sed 's/\(<structure name=\"\)\(.\)/\1\u\2/' xml-file
sed 's/<structure name=\"\(.\)/<structure name=\"\U\1/'
sed will only convert strings being substituted to uppercase. We can use a capturing group to only convert the first character after the sequence to uppercase.
Otherwise, you can also use \E, which is similar to \U, except it stops converting characters instead of starting it.

Resources