Remove duplicates from the same line in a file - shell

How do I remove below duplicates from the same line in a file? I need the duplicates removed including semicolon.
For example from the below output of a file I need only "dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com" similarly other lines of the file.
dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com dg02.server.wmq.host=jms1002-01-ri5.ri5.dc2.responsys.com;jms1002-02-ri5.ri5.dc2.responsys.com dg03.server.wmq.host=jms1003-01-ri5.ri5.dc2.responsys.com;jms1003-02-ri5.ri5.dc2.responsys.com dg04.server.wmq.host=jms1004-01-ri5.ri5.dc2.responsys.com;jms1004-02-ri5.ri5.dc2.responsys.com dg05.server.wmq.host=jms1005-01-ri5.ri5.dc2.responsys.com;jms1005-02-ri5.ri5.dc2.responsys.com dg06.server.wmq.host=jms1006-01-ri5.ri5.dc2.responsys.com;jms1006-02-ri5.ri5.dc2.responsys.com dg07.server.wmq.host=jms1007-01-ri5.ri5.dc2.responsys.com;jms1007-02-ri5.ri5.dc2.responsys.com dg08.server.wmq.host=jms1008-01-ri5.ri5.dc2.responsys.com;jms1008-02-ri5.ri5.dc2.responsys.com dg09.server.wmq.host=jms1009-01-ri5.ri5.dc2.responsys.com;jms1009-02-ri5.ri5.dc2.responsys.com dg10.server.wmq.host=jms1010-01-ri5.ri5.dc2.responsys.com;jms1010-02-ri5.ri5.dc2.responsys.com dg11.server.wmq.host=jms1011-01-ri5.ri5.dc2.responsys.com;jms1011-02-ri5.ri5.dc2.responsys.com dg12.server.wmq.host=jms1012-01-ri5.ri5.dc2.responsys.com;jms1012-02-ri5.ri5.dc2.responsys.com dg13.server.wmq.host=jms1013-01-ri5.ri5.dc2.responsys.com;jms1013-02-ri5.ri5.dc2.responsys.com dg14.server.wmq.host=jms1014-01-ri5.ri5.dc2.responsys.com;jms1014-02-ri5.ri5.dc2.responsys.com dg15.server.wmq.host=jms1015-01-ri5.ri5.dc2.responsys.com;jms1015-02-ri5.ri5.dc2.responsys.com dg16.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com dg17.server.wmq.host=jms1002-01-ri5.ri5.dc2.responsys.com;jms1002-02-ri5.ri5.dc2.responsys.com dg18.server.wmq.host=jms1003-01-ri5.ri5.dc2.responsys.com;jms1003-02-ri5.ri5.dc2.responsys.com dg19.server.wmq.host=jms1004-01-ri5.ri5.dc2.responsys.com;jms1004-02-ri5.ri5.dc2.responsys.com dg20.server.wmq.host=jms1005-01-ri5.ri5.dc2.responsys.com;jms1005-02-ri5.ri5.dc2.responsys.com dg21.server.wmq.host=jms1006-01-ri5.ri5.dc2.responsys.com;jms1006-02-ri5.ri5.dc2.responsys.com dg22.server.wmq.host=jms1007-01-ri5.ri5.dc2.responsys.com;jms1007-02-ri5.ri5.dc2.responsys.com dg23.server.wmq.host=jms1008-01-ri5.ri5.dc2.responsys.com;jms1008-02-ri5.ri5.dc2.responsys.com dg24.server.wmq.host=jms1009-01-ri5.ri5.dc2.responsys.com;jms1009-02-ri5.ri5.dc2.responsys.com dg25.server.wmq.host=jms1010-01-ri5.ri5.dc2.responsys.com;jms1010-02-ri5.ri5.dc2.responsys.com dg26.server.wmq.host=jms1011-01-ri5.ri5.dc2.responsys.com;jms1011-02-ri5.ri5.dc2.responsys.com dg27.server.wmq.host=jms1012-01-ri5.ri5.dc2.responsys.com;jms1012-02-ri5.ri5.dc2.responsys.com dg28.server.wmq.host=jms1013-01-ri5.ri5.dc2.responsys.com;jms1013-02-ri5.ri5.dc2.responsys.com dg29.server.wmq.host=jms1014-01-ri5.ri5.dc2.responsys.com;jms1014-02-ri5.ri5.dc2.responsys.com dg30.server.wmq.host=jms1015-01-ri5.ri5.dc2.responsys.com;jms1015-02-ri5.ri5.dc2.responsys.com dg31.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com dg32.server.wmq.host=jms1002-01-ri5.ri5.dc2.responsys.com;jms1002-02-ri5.ri5.dc2.responsys.com dg33.server.wmq.host=jms1003-01-ri5.ri5.dc2.responsys.com;jms1003-02-ri5.ri5.dc2.responsys.com dg34.server.wmq.host=jms1004-01-ri5.ri5.dc2.responsys.com;jms1004-02-ri5.ri5.dc2.responsys.com dg35.server.wmq.host=jms1009-01-ri5.ri5.dc2.responsys.com;jms1009-02-ri5.ri5.dc2.responsys.com dg36.server.wmq.host=jms1010-01-ri5.ri5.dc2.responsys.com;jms1010-02-ri5.ri5.dc2.responsys.com dg37.server.wmq.host=jms1011-01-ri5.ri5.dc2.responsys.com;jms1011-02-ri5.ri5.dc2.responsys.com dg38.server.wmq.host=jms1012-01-ri5.ri5.dc2.responsys.com;jms1012-02-ri5.ri5.dc2.responsys.com dg39.server.wmq.host=jms1007-01-ri5.ri5.dc2.responsys.com;jms1007-02-ri5.ri5.dc2.responsys.com dg40.server.wmq.host=jms1008-01-ri5.ri5.dc2.responsys.com;jms1008-02-ri5.ri5.dc2.responsys.com

Assuming dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com is a line in your input file and you're only interested in the dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com part (up to, but not including, the semicolumn) you can obtain the desired output by running:
cat inputfile | awk -F ';' {'print $1'}
Another way to obtain the same output, as pointed out by #Shawn, would be:
cut -d ';' -f1 inputfile

Related

How to get values in a line while looping line by line in a file (shell script)

I have a file which looks like this (file.txt)
{"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here
{"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here
{"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here
{"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here
{"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here
{"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here
I want to loop trough this line by line an extract the key values.
so the result should be like ,
AJGUIGIDH568
AJGUIGIDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
So I wrote a code like this to loop line by line and extract the value between {"key":" and ","rule": because key values is in between these 2 patterns.
while read p; do
echo $p | sed -n "/{"key":"/,/","rule":,/p"
done < file.txt
But this is not working. can someone help me to figure out me this. Thanks in advance.
Your sample input is almost valid json. You could tweak it to make it valid and then extract the values with jq with something like:
sed -e 's/squid/"squid/' -e 's/$/"}/' file.txt | jq -r .key
Or, if your actual input really is valid json, then just use jq:
jq -r .key file.txt
If the "random-txt" may include double quotes, making it difficult to massage the input to make it valid json, perhaps you want something like:
awk '{print $4}' FS='"' file.txt
or
sed -n '/{"key":"\([^"]*\).*/s//\1/p' file.txt
or
while IFS=\" read open_brace key colon val _; do echo "$val"; done < file.txt
For the shown data, you can try this awk:
awk -F '"[:,]"' '{print $2}' file
AJGUIGIDH568
TJHJHJHDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
With the give example you can simple use
cut -d'"' -f4 file.txt
Assumptions:
there may be other lines in the file so we need to focus on just the lines with "key" and "rule"
the only text between "key" and "rule" is the desired string (eg, squid never shows up between the two patterns of interest)
Adding some additional lines:
$ cat file.txt
{"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here
ignore this line}
{"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
{"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here
ignore this line}
{"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here
ignore this line}
One sed idea:
$ sed -nE 's/^(.*"key":")([^"]*)(","rule".*)$/\2/p' file.txt
AJGUIGIDH568
TJHJHJHDH568
YUUUIGIDH566
HJHHIGIDH568
ATYUGUIDH556
QfgUIGIDH568
Where:
-E - enable extended regex support (and capture groups without need to escape sequences)
-n - suppress printing of pattern space
^(.*"key":") - [1st capture group] everything from start of line up to and including "key":"
([^"]*) - [2nd capture group] everything that is not a double quote (")
(","rule".*)$ - [3rd capture group] everything from ",rule" to end of line
\2/p - replace the line with the contents of the 2nd capture group and print

replace string with exact match in bash script

I have a many repeated content as give below in a file . These are only uniq content.
CHECKSUM="Y"
CHECKSUM="N"
CHECKSUM="U"
CHECKSUM="
I want to replace empty field with "Null" and need output as :
CHECKSUM="Y"
CHECKSUM="N"
CHECKSUM="U"
CHECKSUM="Null"
What I can think of as :
#First find the matching content
cat file.txt | egrep 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_contain.txt
# Find the content where given string are not there
cat file.txt | egrep -v 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_donot_contain.txt
# Replace the string in content not found file
sed -i 's/CHECKSUM="/CHECKSUM="Null"/g' file_donot_contain.txt
# Merge the files
cat file_contain.txt file_donot_contain.txt > output.txt
But I find this is not efficient way of doing. Any other suggestion ?
To achieve this you need to mark that this is the end of the line, not just part of it, using $ (And optionally ^ to mark the start of the line too):
sed -i s'/^CHECKSUM="$/CHECKSUM="Null"/' file.txt

Unix bash - using cut to regex lines in a file, match regex result with another similar line

I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry.
The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'.
This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file:
while read -r line
do
if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line))
while read -r glline
do
if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) |
"$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv
done < file.txt
done < file.txt
This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement.
Example of text that needs to be parsed, and paired:
gl_one\User Defined\1\String\1\\1\Some Text
gl_two\User Defined\1\String\1\\1\Some Text also
gl_three\User Defined\1\Time\1\\1\Datetime now
some\junk
"gl_one\1\Value1
some\junk
"gl_two\1\Value2
"gl_three\1\Value3
So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'.
It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the
script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three.
The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values.
Wanted regex output stored in variables would be something like this:
first while loop:
$param = gl_one, $def = User Defined, $type = String, $prompt = Some Text
second while loop:
$val = Value1
Then it stores these variables to the file.csv, with semi-colon separators.
Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this,
I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the
semi-colons as .csv separators are added correctly.
Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file?
Any ideas and comments?
This will generate a file containing the data you want:
cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv
It uses grep to extract all lines containing 'gl_'
then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line]
The lines are sorted
sed removes the return from each pair of lines
awk then prints
the required columns according to your requirements
Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\
sed '
/^gl_/{ # if definition
N; # append next line to buffer
s/\n"gl_[^\\]*//; # if value, strip first column
t; # and start next loop
}
D; # otherwise, delete the line
' |\
awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \
>>/filepath/file.csv
sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value
sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk
awk to pull out relevant fields

Output matching lines in linux

I want to match the numbers in the first file with the 2nd column of second file and get the matching lines in a separate output file. Kindly let me know what is wrong with the code?
I have a list of numbers in a file IDS.txt
10028615
1003
10096344
10100
10107393
10113978
10163178
118747520
I have a second File called src1src22.txt
From src:'1' To src:'22'
CHEMBL3549542 118747520
CHEMBL548732 44526300
CHEMBL1189709 11740251
CHEMBL405440 44297517
CHEMBL310280 10335685
expected newoutput.txt
CHEMBL3549542 118747520
I have written this code
while read line; do cat src1src22.txt | grep -i -w "$line" >> newoutput.txt done<IDS.txt
Your command line works - except you're missing a semicolon:
while read line; do grep -i -w "$line" src1src22.txt; done < IDS.txt >> newoutput.txt
I have found an efficient way to perform the task. Instead of a loop try this -f gives the pattern in the file next to it and searches in the next file. The chance of invalid character length which can occur with grep is reduced and looping slows the process down.
grep -iw -f IDS.txt src1src22.tx >>newoutput.txt
Try this -
awk 'NR==FNR{a[$2]=$1;next} $1 in a{print a[$1],$0}' f2 f1
CHEMBL3549542 118747520
Where f2 is src1src22.txt

Get lines by a unique portion of the line, and display only the first occurrence of that unique portion

I'm trying to write a script that looks at a part of a line, does a sort -u or something to look for unique occurrences, and then displays the output, sorted by the ORIGINAL ordering of the lines. In other words, only the FIRST occurrence of that part of the line would show up.
I managed to do it using cut, but my output just displays the cut portion of the data. How could I do it so that it gets the entire line?
Here's what I've got so far:
cut -d, -f6 infile.txt | cut -c4-11 | grep -n . | sort -t: -k2,2 -u | sort -t: -k1n,1 | cut -d: -f2-
I know the data doesn't have an extra : or a , in a place that would break this script. But this only outputs the data that was unique. How can I get the entire line? I would prefer to stay away from perl, but awk is okay (though I don't know it very well).
Sample:
If the input file is this (note, the ABCDEFGH is not real, I just put it there to illustrate what I mean):
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
C....,....,...........,.....,....,...20130718......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
F....,....,...........,.....,....,...20130714......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
H....,....,...........,.....,....,...20130718......,.........,...........,......
My program outputs:
20130718
20130714
20130719
20130713
20130630
I want to see:
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
Yes, awk is your best bet. Here's a mysterious example:
awk -F, '!seen[substr($6,4,8)]++' infile.txt
Explanation:
options:
-F, set the field separator to ,
condition:
substr($6,4,8) up to 8 characters starting at the fourth character
of the sixth field
seen[...]++ seen is an associative array (dictionary). Increment the
value associated with ..., and return the old value
!seen[...]++ if there was no old value, perform the action
action:
There is no action, only a condition, so the default action is
performed if the test succeeds. The default action is to print
the line. So the line will be printed if the relevant characters of
the sixth field haven't yet been seen.
Test:
$ awk -F, '!seen[substr($6,4,8)]++' <<EOF
> A....,....,...........,.....,....,...20130718......,.........,...........,......
> B....,....,...........,.....,....,...20130714......,.........,...........,......
> C....,....,...........,.....,....,...20130718......,.........,...........,......
> D....,....,...........,.....,....,...20130719......,.........,...........,......
> E....,....,...........,.....,....,...20130713......,.........,...........,......
> F....,....,...........,.....,....,...20130714......,.........,...........,......
> G....,....,...........,.....,....,...20130630......,.........,...........,......
> H....,....,...........,.....,....,...20130718......,.........,...........,......
> EOF
A....,....,...........,.....,....,...20130718......,.........,...........,......
B....,....,...........,.....,....,...20130714......,.........,...........,......
D....,....,...........,.....,....,...20130719......,.........,...........,......
E....,....,...........,.....,....,...20130713......,.........,...........,......
G....,....,...........,.....,....,...20130630......,.........,...........,......
$

Resources