Import of SQL Dump - bash

From the below extract of data I am trying to view out only the items in bold, is there a way to do this? The specific section of the file of this is important which is why the 'into .db. values' is important - also assume I do NOT know the actual value of these
grep "INTO .db. VALUES" ./anikokiss_mysql4.sql
INSERT INTO db VALUES ('localhost','blog','yscr_bbYcqN','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog1','yscr_bbS4kf','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog','yscr_bbhrSZ','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog','yscr_bbBl0C','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog','yscr_bbrsKX','Y','Y','Y','Y','Y','N','N','N','N','N');
Result of
cut -d',' -f2 <(awk '/.*INTO .?db.? VALUES.*/,/.*;/{if (NR > 2) nr[NR]}; NR in nr' ./anikokiss_mysql4.sql)
Is
'blog'
As such it appears to only pick the first batch of command seperated values and ignore those after
Similarly with grep command
grep "INTO .db. VALUES" ./anikokiss_mysql4.sql | cut -d "'" -f4
We get the result
blog
When I was updating DB names
find . -name '*.sql' -exec sed -i "s/\\(CREATE DATABASE [^\`]*\`\\)/\\1${cpuser}_/" {} +
Current output
[root#uk01 public_html]# grep -Pzo '(?s)INTO .?db.? VALUES[^(]\K[^;]*' aniko.sql | grep -Pao '\(([^,]*,){2}\K[^,]*' | sed -e 's/^/test_/'
test_'yscr_bbYcqN'
test_'yscr_bbS4kf'
test_'yscr_bbhrSZ'
test_'yscr_bbBl0C'
test_'yscr_bbrsKX'

This uses grep to achieve what you want:
grep -Pzo '(?s)INTO .?db.? VALUES[^(]\K[^;]*' ./anikokiss_mysql4.sql | grep -Pao '\([^,]*,\K[^,]*'
First I use grep to get everything after INTO db VALUES. Then, pass the output of that to another instance of grep which cuts it after the first ,.
To change what column to get you can just keep repeating the look behind in the last grep {N}:
# This gets first column
grep -Pzo '(?s)INTO .?db.? VALUES[^(]\K[^;]*' file | grep -Pao '\(([^,]*,){0}\K[^,]*'
# This gets fourth column
grep -Pzo '(?s)INTO .?db.? VALUES[^(]\K[^;]*' file | grep -Pao '\(([^,]*,){3}\K[^,]*'
Will get the fourth column,

Related

Adjustment to a grep / sed output

I'm trying to edit a grep/sed output - currently it is as follows;
grep -Pzo '(?s)INTO .?db.? VALUES[^(]\K[^;]*' aniko.sql | grep -Pao '\(([^,]*,){2}\K[^,]*'
'yscr_bbYcqN'
'yscr_bbS4kf'
'yscr_bbhrSZ'
'yscr_bbBl0C'
'yscr_bbrsKX'
I am then wanting to add test_ prefix to them all, however, I can only get it to amend one of them - so;
#!/bin/sh
read -p "enter the cPanel username: " cpuser
cd "/home/$cpuser/public_html"
dbusers=($(grep -Pzo '(?s)INTO .?db.? VALUES[^(]\K[^;]*' aniko.sql | grep -Pao '\(([^,]*,){2}\K[^,]*' | grep -Pzo [^\']));
for dbuser in $dbusers; do sed -i "s/$dbuser/${cpuser}\_${dbuser}/g" aniko.sql; done;
The result of this only updates one of the lines (all should show sensationahosti_)
sensationalhosti_yscr_bbYcqN
yscr_bbS4kf
yscr_bbhrSZ
yscr_bbBl0C
yscr_bbrsKX
The format of the file I am changing is as follows;
INSERT INTO `db` VALUES ('localhost','blog','sensationalhosti_yscr_bbYcqN','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog1','yscr_bbS4kf','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog','yscr_bbhrSZ','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog','yscr_bbBl0C','Y','Y','Y','Y','Y','N','N','N','N','N'),('localhost','blog','yscr_bbrsKX','Y','Y','Y','Y','Y','N','N','N','N','N');

Finding missing IDs

I have two files below which contain each line an ID. However, one of the files contains two IDs less.
$> grep ">" output.racon-1.fasta | wc -l
6492
$ grep ">" output.racon-2.fasta | wc -l
6490
How is possible which two IDs are missing?
FILE 1
$ grep ">" output.racon-1.fasta | head
>utg000001l
>utg000002l
>utg000003l
>utg000004l
>utg000005l
>utg000006l
>utg000007l
>utg000008l
>utg000009l
>utg000010l
$ grep ">" output.racon-1.fasta | tail
>utg006483l
>utg006484l
>utg006485l
>utg006486l
>utg006487l
>utg006488l
>utg006489l
>utg006490l
>utg006491l
>utg006492l
FILE 2
$ grep ">" output.racon-2.fasta | head
>utg000001l
>utg000002l
>utg000003l
>utg000004l
>utg000005l
>utg000006l
>utg000007l
>utg000008l
>utg000009l
>utg000010l
$ grep ">" output.racon-2.fasta | tail
>utg006483l
>utg006484l
>utg006485l
>utg006486l
>utg006487l
>utg006488l
>utg006489l
>utg006490l
>utg006491l
>utg006492l
Thank you in advance,
A simple diff with sort could do the job :
diff <(grep ">" output.racon-1.fasta | sort) <(grep ">" output.racon-2.fasta | sort)
As an alternative to using diff you can consider using join. If the files are sorted, this can tell you: (without options) the lines they have in common; using -v1 the lines the first file has that are not present in the second file; using -v2 the lines that are only present in the second file.
So, in your instance, if you believe that the second file is a subset of the first file, you could retrieve the addition lines in the first file with
join -v1 <(grep ">" output.racon-1.fasta) <(grep ">" output.racon-2.fasta)
or (if the files are not sorted already)
join -v1 <(grep ">" output.racon-1.fasta | sort) <(grep ">" output.racon-2.fasta | sort)
[We're using process substitution (the <(...) expressions) to feed the results of your grep commands to join.]
Note however, that if the second file is not a subset of the first, you'll either want to examine the output of the equivalent -v2 lines or take the information from diff.

How to use awk to select text from a file starting from a line number until a certain string

I have this file where I want to read it starting from a certain line number, until a string. I already used
awk "NR>=$LINE && NR<=$((LINE + 121)) {print}" db_000022_model1.dlg
to read from a specific line until and incremented line number, but right now I need to make it stop by itself at a certain string in order to be able to use it on other files.
DOCKED: ENDBRANCH 7 22
DOCKED: TORSDOF 3
DOCKED: TER
DOCKED: ENDMDL
I want it to stop after it reaches
DOCKED: ENDMDL
#!/bin/bash
# This script is for extracting the pdb files from a sorted list of scored
# ligands
mkdir top_poses
for d in $(head -20 summary_2.0.sort | cut -d, -f1 | cut -d/ -f1)
do
cd "$d"||continue
# find the cluster with the highest population within the dlg
RUN=$(grep '###*' "$d.dlg" | sort -k10 -r | head -1 | cut -d\| -f3 | sed 's/ //g')
LINE=$(grep -ni "BEGINNING GENETIC ALGORITHM DOCKING $RUN of 100" "$d.dlg" | cut -d: -f1)
echo "$LINE"
# extract the best pose and correct the format
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
# convert the pdbqt file into pdb
#obabel -ipdbqt $d.pdbqt -opdb -O../top_poses/$d.pdb
cd ..
done
When I try the
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
Just like that in the shell terminal, it works. But in the script it outputs an empty file.
Depending on your requirements for handling DOCKED: ENDMDL occurring before your target line:
awk -v line="$LINE" 'NR>=line; /DOCKED: ENDMDL/{exit}' db_000022_model1.dlg
or:
awk -v line="$LINE" 'NR>=line{print; if (/DOCKED: ENDMDL/) exit}' db_000022_model1.dlg

Find unique URLs in a file

Situation
I have many URLs in a file, and I need to find out how many unique URLs exist.
I would like to run either a bash script or a command.
myfile.log
/home/myfiles/www/wp-content/als/xm-sf0ab5df9c1262f2130a9b313192deca4-f0ab5df9c1262f2130a9b313192deca4-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,18,17
/home/myfiles/www/wp-content/als/xm-s4bf050d47df5bfaf0486a50a8528cb16-4bf050d47df5bfaf0486a50a8528cb16-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,15,14
/home/myfiles/www/wp-content/als/xm-sad122bf22152ba4823a520cc2fe59f40-ad122bf22152ba4823a520cc2fe59f40-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,17,16
/home/myfiles/www/wp-content/als/xm-s3c0f031eebceb0fd5c4334ecef15292d-3c0f031eebceb0fd5c4334ecef15292d-c23c5fbca96e8d641d148bac41017635|https://public.rgfl.org/HS/PowerPoint%20Presentations/Health%20and%20Safety%20Law.ppt,12,11
/home/myfiles/www/wp-content/als/xm-sff661e8c3b4f94957926d5434d0ad549-ff661e8c3b4f94957926d5434d0ad549-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s32c41ec2a5440ad220008b9abfe9add2-32c41ec2a5440ad220008b9abfe9add2-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s28787ca2f4372ddb3616d3fd53c161ab-28787ca2f4372ddb3616d3fd53c161ab-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,22,21
/home/myfiles/www/wp-content/als/xm-s89a7b68158e38391da9f0de1e636c0d5-89a7b68158e38391da9f0de1e636c0d5-c23c5fbca96e8d641d148bac41017635|https://quality.gha.org/Portals/2/documents/HEN/Meetings/nursesinstitute/062013/nursesroleineliminatingharm_moddydunning.pptx,13,12
/home/myfiles/www/wp-content/als/xm-sc4b14e10f6151995f21334061ff1d139-c4b14e10f6151995f21334061ff1d139-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,13,12
/home/myfiles/www/wp-content/als/xm-se589d47d163e43fa0c0d68e824e2c286-e589d47d163e43fa0c0d68e824e2c286-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s52f897a623c539d09bfb988bfb153888-52f897a623c539d09bfb988bfb153888-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,14,13
/home/myfiles/www/wp-content/als/xm-sccf27a904c5b88e96a3522b2e1180fed-ccf27a904c5b88e96a3522b2e1180fed-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,18,17
/home/myfiles/www/wp-content/als/xm-s6874bf9d589708764dab754e5af06ddf-6874bf9d589708764dab754e5af06ddf-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,17,16
/home/myfiles/www/wp-content/als/xm-s46c55ec8387dbdedd7a83b3ad541cdc1-46c55ec8387dbdedd7a83b3ad541cdc1-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hy-wire-car-2.pptx,19,18
/home/myfiles/www/wp-content/als/xm-s08cfdc15f5935b947bbaa93c7193d496-08cfdc15f5935b947bbaa93c7193d496-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,9,8
/home/myfiles/www/wp-content/als/xm-s86e267bd359c12de262c0279cee0c941-86e267bd359c12de262c0279cee0c941-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,15,14
/home/myfiles/www/wp-content/als/xm-s5aa60354d134b87842918d760ec8bc30-5aa60354d134b87842918d760ec8bc30-c23c5fbca96e8d641d148bac41017635|https://royalmechanical.files.wordpress.com/2011/06/hydro-power-plant.ppt,14,13
Desired Result:
Unique Urls: 4
cut -d "|" -f 2 file | cut -d "," -f 1 | sort -u | wc -l
Output:
4
See: man cut, man sort
An awk solution would be
awk '{sub(/^[^|]*\|/,"");gsub(/,[^,]*/,"");i+=a[$0]++?0:1}END{print i}' file
4
If you happen to use GNU awk then below would also give you the same result
awk '{i+=a[gensub(/.*(http[^,]*).*/,"\\1",1)]++?0:1}END{print i}' file
4
Or even short as pointed out in this cracker comment by #cyrus
awk -F '[|,]' '{i+=!a[$2]++} END{print i}' file
4
which uses awk multiple field separator functionality with more idiomatic awk.
Note: See the [ awk manual ] for more info.
Parse with sed, and since file appears to be already sorted,
(with respect to URLs), just run uniq, and count it:
echo Unique URLs: $(sed 's/^.*|\([^,]*\),.*$/\1/' file | uniq | wc -l)
Use GNU grep to extract URLs:
echo Unique URLs: $(grep -o 'ht[^|,]*' file | uniq | wc -l)
Output (either method):
Unique URLs: 4
tr , '|' < myfile.log | sort -u -t '|' -k 2,2 | wc -l
tr , '|' < myfile.log translates all commas into pipe characters
sort -u -t '|' -k 2,2 sorts unique (-u), pipe delimited (-t '|'), in the second field only (-k 2,2)
wc -l counts the unique lines

Comparing output from two greps

I have two C source files with lots of defines and I want to compare them to each other and filter out lines that do not match.
The grep (grep NO_BCM_ include/soc/mcm/allenum.h | grep -v 56440) output of the first file may look like:
...
...
# if !defined(NO_BCM_5675_A0)
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
...
...
where grep (grep "define NO_BCM" include/sdk_custom_config.h) of the second looks like:
...
...
#define NO_BCM_56260_B0
#define NO_BCM_5675_A0
#define NO_BCM_56160_A0
...
...
So now I want to find any type number in the braces above that are missing from the #define below. How do I best go about this?
Thank you
You could use an awk logic with two process-substitution handlers for grep
awk 'FNR==NR{seen[$2]; next}!($2 in seen)' FS=" " <(grep "define NO_BCM" include/sdk_custom_config.h) FS="[()]" <(grep NO_BCM_ include/soc/mcm/allenum.h | grep -v 56440)
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
The idea is the commands within <() will execute and produce the output as needed. The usage of FS before the outputs are to ensure the common entity is parsed with a proper-delimiter.
FS="[()]" is to capture $2 as the unique field in second-group and FS=" " for the default whitespace de-limiting on first group.
The core logic of awk is identifying not repeating elements, i.e. FNR==NR parses the first group storing the unique entries in $2 as a hash-map. Once all the lines are parsed, !($2 in seen) is executed on the second-group which means filter those lines whose $2 from second-group is not present in the hash created.
Use comm this way:
comm -23 <(grep NO_BCM_ include/soc/mcm/allenum.h | cut -f2 -d'(' | cut -f1 -d')' | sort) <(grep "define NO_BCM" include/sdk_custom_config.h | cut -f2 -d' ' | sort)
This would give tokens unique to include/soc/mcm/allenum.h.
Output:
NO_BCM_2801PM_A0
NO_BCM_88660_A0
If you want the full lines from that file, then you can use fgrep:
fgrep -f <(comm -23 <(grep NO_BCM_ include/soc/mcm/allenum.h | cut -f2 -d'(' | cut -f1 -d')' | sort) <(grep "define NO_BCM" include/sdk_custom_config.h | cut -f2 -d' ' | sort)) include/soc/mcm/allenum.h
Output:
# if !defined(NO_BCM_88660_A0)
# if !defined(NO_BCM_2801PM_A0)
About comm:
NAME
comm - compare two sorted files line by line
SYNOPSIS
comm [OPTION]... FILE1 FILE2
DESCRIPTION
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one contains lines unique to FILE1, column two contains lines unique to
FILE2, and column three contains lines common to both files.
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
It's hard to say without the surrounding context from your sample input files and no expected output but it sounds like this is all you need:
awk '!/define.*NO_BCM_/{next} NR==FNR{defined[$2];next} !($2 in defined)' include/sdk_custom_config.h FS='[()]' include/soc/mcm/allenum.h

Resources