Find position of the first occurence of a substring in a file - shell

I have a very large file, which is made of only one line (no CR at all).
I have several occurences of the same pattern (let's say here , the pattern is ABCDE).
I want to return the starting position or the starting column of the first character of the first occurence of this pattern...
for example, if this is the data in the file :
123456ABCDEF456987ABCDEFjhkhkhkhABCDEF
I want to return 7 as the starting column of the first occurence of the pattern...
thanks community :-)

Use awk index() function:
awk -v pattern="ABCDE" '{print index($0,pattern)}' file

Use the "C" option of "split", so there will be no need to repair the files afterwards.
-C, --line-bytes=SIZE
put at most SIZE bytes of lines per output file

Related

Appending a count to a code in multiple files and saving the result

I'm looking for a bit of help here. I'm a complete newbie!
I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.
After some digging I have found-
perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml
but the issue is the counter does not reset in each file.
That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.
What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.
Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done
This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

How to delete a series of positions within a file based on a list of numbers with Bash

I'm pretty new in Bash scripting and i have a problem to solve. I have a file that look like this:
>atac
ATTGGCAATTAAATTCTTTT
>lipa
ATTACCAAGTAAATTCTTTT
.
.
.
where each even lines have the same length, but can have different characters, and i need to remove, in each even lines, a series of position listed in a .txt file. The .txt have only a list of number, one for each lines, that correspond to the positions to be removed and look like this:
3
5
8
10
11
the expected output must keep the same length for each even line, but in each of them, the positions listed in the .txt file must have been deleted.
Any suggestion?
If the "position" in the txt file indicates always the index of the original string, this awk-oneliner will help you:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x=""}7' your.txt FS="" OFS="" file
>atac
ATGCATAATTCTTTT
>lipa
ATACAGAATTCTTTT
We mark (as "-") the deleted char so that you can verify if the result is correct:
awk 'NR==FNR{a[$0];next}FNR%2==0{for(x in a)$x="-"}7' txt FS="" OFS="" file
>atac
AT-G-CA-T--AATTCTTTT
>lipa
AT-A-CA-G--AATTCTTTT

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

parse CSV, Group all rows containing string at 5th field, export each group of rows to file with filename <group>_someconstant.csv

Need this in bash.
In a linux directory, I will have a CSV file. Arbitrarily, this file will have 6 rows.
Main_Export.csv
1,2,3,4,8100_group1,6,7,8
1,2,3,4,8100_group1,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8
I need to parse this file's 5th field (first four chars only) and take each row with 8100 (for example) and put those rows in a new file. Same with all other groups that exist, across the entire file.
Each new file can only contain the rows for its group (one file with the rows for 8100, one file for the rows with 3100, etc.)
Each filename needs to have that group# prepended to it.
The first four characters could be any numeric value, so I can't check these against a list - there are like 50 groups, and maintenance can't be done on this if a group # changes.
When parsing the fifth field, I only care about the first four characters
So we'd start with: Main_Export.csv and end up with four files:
Main_Export_$date.csv (unchanged)
8100_filenameconstant_$date.csv
3100_filenameconstant_$date.csv
5400_filenameconstant_$date.csv
I'm not sure the rules of the site. If I have to try this for myself first and then post this. I'll come back once I have an idea - but I'm at a total loss. Reading up on awk right now.
If I have understood well your problem this is very easy...
You can just:
$ awk -F, '{fifth=substr($5, 1, 4) ; print > (fifth "_mysuffix.csv")}' file.cv
or just:
$ awk -F, '{print > (substr($5, 1, 4) "_mysuffix.csv")}' file.csv
And you will get several files like:
$ cat 3100_mysuffix.csv
1,2,3,4,3100_group2,6,7,8
1,2,3,4,3100_group2,6,7,8
or...
$ cat 5400_mysuffix.csv
1,2,3,4,5400_group3,6,7,8
1,2,3,4,5400_group3,6,7,8

extract string conditionally from variable column sized text file

From a text file with variable number of columns per row (tab delimited), I would like to extract value with specific condition.
The text file looks like:
S1=dhs Sb=skf S3=ghw QS=ghr</b>
S1=dhf QS=thg S3=eiq<b/>
QS=bhf S3=ruq Gq=qpq GW=tut<b/>
Sb=ruw QS=ooe Gq=qfj GW=uvd<b/>
I would like to have a result like:
QS=ghr<b/>
QS=thg<b/>
QS=bhf<b/>
QS=ooe
Please excuse my naive question but I am a beginner trying to learn some basic bash scripting technique for text manipulation.
Thanks in advance!
You could use awk ,
awk '{for(i=1;i<=NF;i++){if($i~/^QS=/){print $i}}}' file
This awk command iterates through each fields and check for the column which has QS= string at the start. If it finds any, then the corresponding column would be printed.
Through grep,
grep -oP '(^|\t)\KQS=\S*' file
-o parameter means only matching. So it prints only the characters which are matched.
-P this enables the Perl-regex mode.
(^|\t) matches the start of a line or a tab character.
\K discards the previously matched tab or start of the line boundary.
QS= Now it matches the QS= string.
\S* Matches zero or more non-space characters.

Resources