Remove duplicates from the same line in a file - shell
How do I remove below duplicates from the same line in a file? I need the duplicates removed including semicolon.
For example from the below output of a file I need only "dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com" similarly other lines of the file.
dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com dg02.server.wmq.host=jms1002-01-ri5.ri5.dc2.responsys.com;jms1002-02-ri5.ri5.dc2.responsys.com dg03.server.wmq.host=jms1003-01-ri5.ri5.dc2.responsys.com;jms1003-02-ri5.ri5.dc2.responsys.com dg04.server.wmq.host=jms1004-01-ri5.ri5.dc2.responsys.com;jms1004-02-ri5.ri5.dc2.responsys.com dg05.server.wmq.host=jms1005-01-ri5.ri5.dc2.responsys.com;jms1005-02-ri5.ri5.dc2.responsys.com dg06.server.wmq.host=jms1006-01-ri5.ri5.dc2.responsys.com;jms1006-02-ri5.ri5.dc2.responsys.com dg07.server.wmq.host=jms1007-01-ri5.ri5.dc2.responsys.com;jms1007-02-ri5.ri5.dc2.responsys.com dg08.server.wmq.host=jms1008-01-ri5.ri5.dc2.responsys.com;jms1008-02-ri5.ri5.dc2.responsys.com dg09.server.wmq.host=jms1009-01-ri5.ri5.dc2.responsys.com;jms1009-02-ri5.ri5.dc2.responsys.com dg10.server.wmq.host=jms1010-01-ri5.ri5.dc2.responsys.com;jms1010-02-ri5.ri5.dc2.responsys.com dg11.server.wmq.host=jms1011-01-ri5.ri5.dc2.responsys.com;jms1011-02-ri5.ri5.dc2.responsys.com dg12.server.wmq.host=jms1012-01-ri5.ri5.dc2.responsys.com;jms1012-02-ri5.ri5.dc2.responsys.com dg13.server.wmq.host=jms1013-01-ri5.ri5.dc2.responsys.com;jms1013-02-ri5.ri5.dc2.responsys.com dg14.server.wmq.host=jms1014-01-ri5.ri5.dc2.responsys.com;jms1014-02-ri5.ri5.dc2.responsys.com dg15.server.wmq.host=jms1015-01-ri5.ri5.dc2.responsys.com;jms1015-02-ri5.ri5.dc2.responsys.com dg16.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com dg17.server.wmq.host=jms1002-01-ri5.ri5.dc2.responsys.com;jms1002-02-ri5.ri5.dc2.responsys.com dg18.server.wmq.host=jms1003-01-ri5.ri5.dc2.responsys.com;jms1003-02-ri5.ri5.dc2.responsys.com dg19.server.wmq.host=jms1004-01-ri5.ri5.dc2.responsys.com;jms1004-02-ri5.ri5.dc2.responsys.com dg20.server.wmq.host=jms1005-01-ri5.ri5.dc2.responsys.com;jms1005-02-ri5.ri5.dc2.responsys.com dg21.server.wmq.host=jms1006-01-ri5.ri5.dc2.responsys.com;jms1006-02-ri5.ri5.dc2.responsys.com dg22.server.wmq.host=jms1007-01-ri5.ri5.dc2.responsys.com;jms1007-02-ri5.ri5.dc2.responsys.com dg23.server.wmq.host=jms1008-01-ri5.ri5.dc2.responsys.com;jms1008-02-ri5.ri5.dc2.responsys.com dg24.server.wmq.host=jms1009-01-ri5.ri5.dc2.responsys.com;jms1009-02-ri5.ri5.dc2.responsys.com dg25.server.wmq.host=jms1010-01-ri5.ri5.dc2.responsys.com;jms1010-02-ri5.ri5.dc2.responsys.com dg26.server.wmq.host=jms1011-01-ri5.ri5.dc2.responsys.com;jms1011-02-ri5.ri5.dc2.responsys.com dg27.server.wmq.host=jms1012-01-ri5.ri5.dc2.responsys.com;jms1012-02-ri5.ri5.dc2.responsys.com dg28.server.wmq.host=jms1013-01-ri5.ri5.dc2.responsys.com;jms1013-02-ri5.ri5.dc2.responsys.com dg29.server.wmq.host=jms1014-01-ri5.ri5.dc2.responsys.com;jms1014-02-ri5.ri5.dc2.responsys.com dg30.server.wmq.host=jms1015-01-ri5.ri5.dc2.responsys.com;jms1015-02-ri5.ri5.dc2.responsys.com dg31.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com dg32.server.wmq.host=jms1002-01-ri5.ri5.dc2.responsys.com;jms1002-02-ri5.ri5.dc2.responsys.com dg33.server.wmq.host=jms1003-01-ri5.ri5.dc2.responsys.com;jms1003-02-ri5.ri5.dc2.responsys.com dg34.server.wmq.host=jms1004-01-ri5.ri5.dc2.responsys.com;jms1004-02-ri5.ri5.dc2.responsys.com dg35.server.wmq.host=jms1009-01-ri5.ri5.dc2.responsys.com;jms1009-02-ri5.ri5.dc2.responsys.com dg36.server.wmq.host=jms1010-01-ri5.ri5.dc2.responsys.com;jms1010-02-ri5.ri5.dc2.responsys.com dg37.server.wmq.host=jms1011-01-ri5.ri5.dc2.responsys.com;jms1011-02-ri5.ri5.dc2.responsys.com dg38.server.wmq.host=jms1012-01-ri5.ri5.dc2.responsys.com;jms1012-02-ri5.ri5.dc2.responsys.com dg39.server.wmq.host=jms1007-01-ri5.ri5.dc2.responsys.com;jms1007-02-ri5.ri5.dc2.responsys.com dg40.server.wmq.host=jms1008-01-ri5.ri5.dc2.responsys.com;jms1008-02-ri5.ri5.dc2.responsys.com
Assuming dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com;jms1001-02-ri5.ri5.dc2.responsys.com is a line in your input file and you're only interested in the dg01.server.wmq.host=jms1001-01-ri5.ri5.dc2.responsys.com part (up to, but not including, the semicolumn) you can obtain the desired output by running:
cat inputfile | awk -F ';' {'print $1'}
Another way to obtain the same output, as pointed out by #Shawn, would be:
cut -d ';' -f1 inputfile
Related
How to get values in a line while looping line by line in a file (shell script)
I have a file which looks like this (file.txt) {"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here {"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here {"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here {"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here {"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here {"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here I want to loop trough this line by line an extract the key values. so the result should be like , AJGUIGIDH568 AJGUIGIDH568 YUUUIGIDH566 HJHHIGIDH568 ATYUGUIDH556 QfgUIGIDH568 So I wrote a code like this to loop line by line and extract the value between {"key":" and ","rule": because key values is in between these 2 patterns. while read p; do echo $p | sed -n "/{"key":"/,/","rule":,/p" done < file.txt But this is not working. can someone help me to figure out me this. Thanks in advance.
Your sample input is almost valid json. You could tweak it to make it valid and then extract the values with jq with something like: sed -e 's/squid/"squid/' -e 's/$/"}/' file.txt | jq -r .key Or, if your actual input really is valid json, then just use jq: jq -r .key file.txt If the "random-txt" may include double quotes, making it difficult to massage the input to make it valid json, perhaps you want something like: awk '{print $4}' FS='"' file.txt or sed -n '/{"key":"\([^"]*\).*/s//\1/p' file.txt or while IFS=\" read open_brace key colon val _; do echo "$val"; done < file.txt
For the shown data, you can try this awk: awk -F '"[:,]"' '{print $2}' file AJGUIGIDH568 TJHJHJHDH568 YUUUIGIDH566 HJHHIGIDH568 ATYUGUIDH556 QfgUIGIDH568
With the give example you can simple use cut -d'"' -f4 file.txt
Assumptions: there may be other lines in the file so we need to focus on just the lines with "key" and "rule" the only text between "key" and "rule" is the desired string (eg, squid never shows up between the two patterns of interest) Adding some additional lines: $ cat file.txt {"key":"AJGUIGIDH568","rule":squid:111-some_random_text_here ignore this line} {"key":"TJHJHJHDH568","rule":squid:111-some_random_text_here ignore this line} {"key":"YUUUIGIDH566","rule":squid:111-some_random_text_here ignore this line} {"key":"HJHHIGIDH568","rule":squid:111-some_random_text_here ignore this line} {"key":"ATYUGUIDH556","rule":squid:111-some_random_text_here ignore this line} {"key":"QfgUIGIDH568","rule":squid:111-some_random_text_here ignore this line} One sed idea: $ sed -nE 's/^(.*"key":")([^"]*)(","rule".*)$/\2/p' file.txt AJGUIGIDH568 TJHJHJHDH568 YUUUIGIDH566 HJHHIGIDH568 ATYUGUIDH556 QfgUIGIDH568 Where: -E - enable extended regex support (and capture groups without need to escape sequences) -n - suppress printing of pattern space ^(.*"key":") - [1st capture group] everything from start of line up to and including "key":" ([^"]*) - [2nd capture group] everything that is not a double quote (") (","rule".*)$ - [3rd capture group] everything from ",rule" to end of line \2/p - replace the line with the contents of the 2nd capture group and print
replace string with exact match in bash script
I have a many repeated content as give below in a file . These are only uniq content. CHECKSUM="Y" CHECKSUM="N" CHECKSUM="U" CHECKSUM=" I want to replace empty field with "Null" and need output as : CHECKSUM="Y" CHECKSUM="N" CHECKSUM="U" CHECKSUM="Null" What I can think of as : #First find the matching content cat file.txt | egrep 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_contain.txt # Find the content where given string are not there cat file.txt | egrep -v 'CHECKSUM="Y"|CHECKSUM="N"|CHECKSUM="U"' > file_donot_contain.txt # Replace the string in content not found file sed -i 's/CHECKSUM="/CHECKSUM="Null"/g' file_donot_contain.txt # Merge the files cat file_contain.txt file_donot_contain.txt > output.txt But I find this is not efficient way of doing. Any other suggestion ?
To achieve this you need to mark that this is the end of the line, not just part of it, using $ (And optionally ^ to mark the start of the line too): sed -i s'/^CHECKSUM="$/CHECKSUM="Null"/' file.txt
Unix bash - using cut to regex lines in a file, match regex result with another similar line
I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry. The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'. This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file: while read -r line do if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line)) while read -r glline do if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) | "$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv done < file.txt done < file.txt This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement. Example of text that needs to be parsed, and paired: gl_one\User Defined\1\String\1\\1\Some Text gl_two\User Defined\1\String\1\\1\Some Text also gl_three\User Defined\1\Time\1\\1\Datetime now some\junk "gl_one\1\Value1 some\junk "gl_two\1\Value2 "gl_three\1\Value3 So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'. It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three. The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values. Wanted regex output stored in variables would be something like this: first while loop: $param = gl_one, $def = User Defined, $type = String, $prompt = Some Text second while loop: $val = Value1 Then it stores these variables to the file.csv, with semi-colon separators. Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this, I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the semi-colons as .csv separators are added correctly. Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file? Any ideas and comments?
This will generate a file containing the data you want: cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv It uses grep to extract all lines containing 'gl_' then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line] The lines are sorted sed removes the return from each pair of lines awk then prints the required columns according to your requirements Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\ sed ' /^gl_/{ # if definition N; # append next line to buffer s/\n"gl_[^\\]*//; # if value, strip first column t; # and start next loop } D; # otherwise, delete the line ' |\ awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \ >>/filepath/file.csv sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk awk to pull out relevant fields
Output matching lines in linux
I want to match the numbers in the first file with the 2nd column of second file and get the matching lines in a separate output file. Kindly let me know what is wrong with the code? I have a list of numbers in a file IDS.txt 10028615 1003 10096344 10100 10107393 10113978 10163178 118747520 I have a second File called src1src22.txt From src:'1' To src:'22' CHEMBL3549542 118747520 CHEMBL548732 44526300 CHEMBL1189709 11740251 CHEMBL405440 44297517 CHEMBL310280 10335685 expected newoutput.txt CHEMBL3549542 118747520 I have written this code while read line; do cat src1src22.txt | grep -i -w "$line" >> newoutput.txt done<IDS.txt
Your command line works - except you're missing a semicolon: while read line; do grep -i -w "$line" src1src22.txt; done < IDS.txt >> newoutput.txt
I have found an efficient way to perform the task. Instead of a loop try this -f gives the pattern in the file next to it and searches in the next file. The chance of invalid character length which can occur with grep is reduced and looping slows the process down. grep -iw -f IDS.txt src1src22.tx >>newoutput.txt
Try this - awk 'NR==FNR{a[$2]=$1;next} $1 in a{print a[$1],$0}' f2 f1 CHEMBL3549542 118747520 Where f2 is src1src22.txt
Get lines by a unique portion of the line, and display only the first occurrence of that unique portion
I'm trying to write a script that looks at a part of a line, does a sort -u or something to look for unique occurrences, and then displays the output, sorted by the ORIGINAL ordering of the lines. In other words, only the FIRST occurrence of that part of the line would show up. I managed to do it using cut, but my output just displays the cut portion of the data. How could I do it so that it gets the entire line? Here's what I've got so far: cut -d, -f6 infile.txt | cut -c4-11 | grep -n . | sort -t: -k2,2 -u | sort -t: -k1n,1 | cut -d: -f2- I know the data doesn't have an extra : or a , in a place that would break this script. But this only outputs the data that was unique. How can I get the entire line? I would prefer to stay away from perl, but awk is okay (though I don't know it very well). Sample: If the input file is this (note, the ABCDEFGH is not real, I just put it there to illustrate what I mean): A....,....,...........,.....,....,...20130718......,.........,...........,...... B....,....,...........,.....,....,...20130714......,.........,...........,...... C....,....,...........,.....,....,...20130718......,.........,...........,...... D....,....,...........,.....,....,...20130719......,.........,...........,...... E....,....,...........,.....,....,...20130713......,.........,...........,...... F....,....,...........,.....,....,...20130714......,.........,...........,...... G....,....,...........,.....,....,...20130630......,.........,...........,...... H....,....,...........,.....,....,...20130718......,.........,...........,...... My program outputs: 20130718 20130714 20130719 20130713 20130630 I want to see: A....,....,...........,.....,....,...20130718......,.........,...........,...... B....,....,...........,.....,....,...20130714......,.........,...........,...... D....,....,...........,.....,....,...20130719......,.........,...........,...... E....,....,...........,.....,....,...20130713......,.........,...........,...... G....,....,...........,.....,....,...20130630......,.........,...........,......
Yes, awk is your best bet. Here's a mysterious example: awk -F, '!seen[substr($6,4,8)]++' infile.txt Explanation: options: -F, set the field separator to , condition: substr($6,4,8) up to 8 characters starting at the fourth character of the sixth field seen[...]++ seen is an associative array (dictionary). Increment the value associated with ..., and return the old value !seen[...]++ if there was no old value, perform the action action: There is no action, only a condition, so the default action is performed if the test succeeds. The default action is to print the line. So the line will be printed if the relevant characters of the sixth field haven't yet been seen. Test: $ awk -F, '!seen[substr($6,4,8)]++' <<EOF > A....,....,...........,.....,....,...20130718......,.........,...........,...... > B....,....,...........,.....,....,...20130714......,.........,...........,...... > C....,....,...........,.....,....,...20130718......,.........,...........,...... > D....,....,...........,.....,....,...20130719......,.........,...........,...... > E....,....,...........,.....,....,...20130713......,.........,...........,...... > F....,....,...........,.....,....,...20130714......,.........,...........,...... > G....,....,...........,.....,....,...20130630......,.........,...........,...... > H....,....,...........,.....,....,...20130718......,.........,...........,...... > EOF A....,....,...........,.....,....,...20130718......,.........,...........,...... B....,....,...........,.....,....,...20130714......,.........,...........,...... D....,....,...........,.....,....,...20130719......,.........,...........,...... E....,....,...........,.....,....,...20130713......,.........,...........,...... G....,....,...........,.....,....,...20130630......,.........,...........,...... $