Grep approach to remove all lines in file that match any line in other file? - bash

I have a file of camera information where each line has a unique ID of the format
{"_id":{"$oid":"5b0cfa5845bb0c0004277e13"},"geometry":{"coordinates":[139.751,35.685]},"addEditBy":["dd53cbd9c5306b1baa103335c4b3e91d8b73386ba29124ea2b1d47a619c8c066877843cd8a7745ce31021a8d1548cf2a"],"legacy_cameraID":1,"type":"ip","source":"google","country":"JP","city":"Tokyo","is_active_image":false,"is_active_video":false,"utc_offset":32400,"timezone_id":"Japan Standard Time","timezone_name":"Japan Standard Time","reference_url":"101.110.193.152/","retrieval":{"ip":"101.110.193.152","port":"80","video_path":"/"},"__v":0}
I also have a list of camera IDs that I want to remove from the original file in the format:
5b182800751c3b00044514a9
5b1976b473569e00045dba59
5b197b1273569e00045ddf0f
5b1970cc73569e00045d94fc
How can I use grep or some other command line utility to remove all lines in the input file that have an ID listed in the second file?

Let's say that you have a file called ids.txt that has all of the camera id's that need to be excluded from your data file, which we'll call data.json. We can use the -f option of grep (match from a file) and the -v option (only output non-matching lines) as follows:
grep -f ids.txt -v data.json
grep will only output lines of data.json that do not match any lines in ids.txt.

You should use json aware tool. Here is a GNU awk script that uses json extension:
$ gawk ' # GNU awk
#load "json" # load extension
NR==FNR { # read oids to a hash
oid[$0]
next
}
{ # process json
lines=lines $0 # support multiline json form
if(json_fromJSON(lines,data)!=0) { # once json is complete
if(!(data["_id"]["$oid"] in oid)) # test if oid in exclude list
print # output if not
lines="" # rinse for repeat
}
}' oids json

A simple thing you can do is get ids from camera info and check if they are listed in the second file.
For example:
#!/bin/bash
exec 3<info.txt
while IFS= read -r line <&3; do
id="$(printf '%s' "${line}" | jq '._id."$oid"' | sed -e 's/"//g')"
if ! grep -e "${id}" list.txt >/dev/null; then
printf '%s\n' "${line}"
fi
done >clean.txt
exec 3>&-
Where:
info.txt is the file with camera information
list.txt is the list of ids you do not want
Note that this is not the only way you can achieve it, I used a simple cycle just as poc.
You can achieve it using directly jq, for example:
#!/bin/bash
for id in $(jq '._id."$oid"' info.txt | sed -e 's/"//g'); do
if ! grep -e "${id}" list.txt >/dev/null; then
grep -e "${id}" info.txt
fi
done >clean.txt
Note that in this second example the second grep is needed because you never take the whole line of the into.txt file, only the id.
Also, be aware that if you have an alias like alias grep='grep --color=always' it could break your output.

Assuming your json file is always that regular:
awk -F'"' 'NR==FNR{ids[$1]; next} !($6 in ids)' ids json

Related

Concatenate files based on numeric sort of name substring in awk w/o header

I am interested in concatenate many files together based on the numeric number and also remove the first line.
e.g. chr1_smallfiles then chr2_smallfiles then chr3_smallfiles.... etc (each without the header)
Note that chr10_smallfiles needs to come after chr9_smallfiles -- that is, this needs to be numeric sort order.
When separate the two command awk and ls -v1, each does the job properly, but when put them together, it doesn't work. Please help thanks!
awk 'FNR>1' | ls -v1 chr*_smallfiles > bigfile
The issue is with the way that you're trying to pass the list of files to awk. At the moment, you're piping the output of awk to ls, which makes no sense.
Bear in mind that, as mentioned in the comments, ls is a tool for interactive use, and in general its output shouldn't be parsed.
If sorting weren't an issue, you could just use:
awk 'FNR > 1' chr*_smallfiles > bigfile
The shell will expand the glob chr*_smallfiles into a list of files, which are passed as arguments to awk. For each filename argument, all but the first line will be printed.
Since you want to sort the files, things aren't quite so simple. If you're sure the full range of files exist, just replace chr*_smallfiles with chr{1..99}_smallfiles in the original command.
Using some Bash-specific and GNU sort features, you can also achieve the sorting like this:
printf '%s\0' chr*_smallfiles | sort -z -n -k1.4 | xargs -0 awk 'FNR > 1' > bigfile
printf '%s\0' prints each filename followed by a null-byte
sort -z sorts records separated by null-bytes
-n -k1.4 does a numeric sort, starting from the 4th character (the numeric part of the filename)
xargs -0 passes the sorted, null-separated output as arguments to awk
Otherwise, if you want to go through the files in numerical order, and you're not sure whether all the files exist, then you can use a shell loop (although it'll be significantly slower than a single awk invocation):
for file in chr{1..99}_smallfiles; do # 99 is the maximum file number
[ -f "$file" ] || continue # skip missing files
awk 'FNR > 1' "$file"
done > bigfile
You can also use tail to concatenate all the files without header
tail -q -n+2 chr*_smallfiles > bigfile
In case you want to concatenate the files in a natural sort order as described in your quesition, you can pipe the result of ls -v1 to xargs using
ls -v1 chr*_smallfiles | xargs -d $'\n' tail -q -n+2 > bigfile
(Thanks to Charles Duffy) xargs -d $'\n' sets the delimiter to a newline \n in case the filename contains white spaces or quote characters
Using a bash 4 associative array to extract only the numeric substring of each filename; sort those individually; and then retrieve and concatenate the full names in the resulting order:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "Requires bash 4.0 or newer" >&2; exit 1;; esac
# when this is done, you'll have something like:
# files=( [1]=chr_smallfiles1.txt
# [10]=chr_smallfiles10.txt
# [9]=chr_smallfiles9.txt )
declare -A files=( )
for f in chr*_smallfiles.txt; do
files[${f//[![:digit:]]/}]=$f
done
# now, emit those indexes (1, 10, 9) to "sort -n -z" to sort them as numbers
# then read those numbers, look up the filenames associated, and pass to awk.
while read -r -d '' key; do
awk 'FNR > 1' <"${files[$key]}"
done < <(printf '%s\0' "${!files[#]}" | sort -n -z) >bigfile
You can do with a for loop like below, which is working for me:-
for file in chr*_smallfiles
do
tail +2 "$file" >> bigfile
done
How will it work? For loop read all the files from current directory with wild chard character * chr*_smallfiles and assign the file name to variable file and tail +2 $file will output all the lines of that file except the first line and append in file bigfile. So finally all files will be merged (accept the first line of each file) into one i.e. file bigfile.
Just for completeness, how about a sed solution?
for file in chr*_smallfiles
do
sed -n '2,$p' $file >> bigfile
done
Hope it helps!

How to copy specific columns from one csv file to another csv file?

File1.csv:
File2.csv:
I want to replace the contents of configSku,selectedSku,config_id in File1.csv with the contents of configSku,selectedSku,config_idfrom File2.csv. The end result should look like this:
Here are the links to download the files so you can try it yourself:
File1.csv: https://www.dropbox.com/s/2o12qjzqlcgotxr/file1.csv?dl=0
File2.csv: https://www.dropbox.com/s/331lpqlvaaoljil/file2.csv?dl=0
Here's what I have tried but still failed:
#!/bin/bash
INPUT=/tmp/file2.csv
OLDIFS=$IFS
IFS=,
[ ! -f $INPUT ] && { echo "$INPUT file not found"; exit 99; }
echo "no,my_account,form_token,fingerprint,configSku,selectedSku,config_id,address1,item_title" > /tmp/temp.csv
while read item_title configSku selectedSku config_id
do
cat /tmp/file1.csv |
awk -F ',' -v item_title="$item_title" \
-v configSku="$configSku" \
-v selectedSku="$selectedSku" \
-v config_id="$config_id" \
-v OFS=',' 'NR>1{$5=configSku; $6=selectedSku; $7=config_id; $9=item_title; print}' >> /tmp/temp.csv
done < <(tail -n +2 "$INPUT")
IFS=$OLDIFS
How do I do this ?
If I understood the question correctly, how about using:
paste -d, file1.csv file2.csv | awk -F, -v OFS=',' '{print $1,$2,$3,$4,$11,$12,$13,$8,$10}'
This is not as nearly as robust as the other answer, and assumes that file1.csv and file2.csv have the same number of lines and each line in one file corresponds to the same line on the other file. the output would look like this:
no,my_account,form_token,fingerprint,configSku,selectedSku,config_id,address1,item_title
1,account1,asdf234safd,sd4d5s6sa,NEWconfigSku1,NEWselectedSku1,NEWconfig_id1,myaddr1,Samsung Handsfree
2,account2,asdf234safd,sd4d5s6sa,NEWconfigSku2,NEWselectedSku2,NEWconfig_id2,myaddr2,Xiaomi Mi headset
3,account3,asdf234safd,sd4d5s6sa,NEWconfigSku3,NEWselectedSku3,NEWconfig_id3,myaddr3,Ear Headphones with Mic
4,account4,asdf234safd,sd4d5s6sa,NEWconfigSku4,NEWselectedSku4,NEWconfig_id4,myaddr4,Handsfree/Headset
The first part is using paste to put the files side-by-side, separated by comma, hence the -d option. Then, you end up with a combined file with 13 columns. The awk part first tells that the input and output field separators should be comma (-F,and -v OFS=',', respectively) and then prints the desired columns (columns 1-4 from first file, then columns 2-4 of the second file, which now correspond to columns 11-13 on the merged file.
The main issue in your original script is that you are reading one file (/tmp/file2.csv) one line at a time, and for each line, your parse and print the whole other file (/tmp/file1.csv).
Here is an example how to merge two csv files in bash:
#!/bin/bash
# Open both files in "reading mode"
exec 3<"$1"
exec 4<"$2"
# Read(/discard) the header line in both csv files
read -r -u 3
read -r -u 4
# Print the new header line
printf "your,own,header,line\n"
# Read both files one line at a time and print the merged result
while true; do
IFS="," read -r -u 3 your own || break
IFS="," read -r -u 4 header line
printf "%s,%s,%s,%s\n" "$your" "$own" "$header" "$line"
done
exec 3<&-
exec 4<&-
Assuming you saved the script above in "merge_csv.sh", you can use it like this:
$ bash merge_csv.sh /tmp/file1.csv /tmp/file2.csv > /tmp/temp.csv
Be sure to modify the script to suit your needs (I did not use the headers you provided in your question).
If you are not familiar with the exec command, the tldp documentation and the bash hackers wiki both have an entry about it. The man page for read should document the -u option well enough. Finally, the VAR="something" command arg1 arg2 (used in the script for IFS=',' read -u -r 3) is a common construct in shell scripting. If you are not familiar with it, I believe this answer should provide enough information on what it does.
Note: if you want to do more complex processing of csv files I recommend using python and its csv package.

Bash awk append to same line

There are numerous posts about removing leading white space and appending an entry to a single existing line in a file using awk. None of my attempts work - just three examples here of the many I have tried.
Say I have a file called $log with a single line
a:b:c
and I want to add a fourth entry,
awk '{ print $4"d" }' $log | tee -a $log
output seems to be a newline
`a:b:c:
d`
whereas, I want all on the same line;
a:b:c:d
try
BEGIN { FS = ":" } ; awk '{ print $4"d" }' $log | tee -a $log
or, this - avoid a new line
awk 'BEGIN { ORS=":" }; { print $4"d" }' $log | tee -a $log
no change
`a:b:c:
d`
awk is placing a space after c: and then writing d to the next line.
EDIT: | tee -a $log appears to be necessary to write the additional string to the file.
$log contains 39 variables and was generated using awk without | tee -a
odd...
The actual command to write $40 to the single line entries
awk '{ print $40"'$imagedir'" }' $log
output
+ awk '{ print $40"/home/geoland/Asterism-DEVEL/DSO" }'
/home/geoland/.asterism/log
but this does not write to the $log file.
How should I append d to the same line without leading white space using awk - also looking at sed xargs and other alternatives.
Using awk:
awk '{ print $0":d" }' file
Using sed:
sed 's/$/:d/' file
Using only bash:
while IFS= read -r line; do
echo "$line:d"
done < file
Using sed:
$ echo a:b:c | sed 's,\(^.*$\),\1:d,'
a:b:c:d
Thanks all... This is the solution I went with. I also needed to write the entire line to a perpetual log file because the log file is overwritten at each new process instance.
I will further investigate an awk solution.
logname=$imagedir/log_$name
while IFS=: read -r line; do
echo "$line$imagedir"
done < $log | tee $logname
This places $imagedir directly behind the last IFS ':' separator
There is probably room for refinement.
I too am not entirely sure what you're trying to do here.
Your command line, awk '{ print $4"d" }' $log | tee -a $log is problematic in a number of ways.
First, your awk script tries to print the 4th field, which is empty. Unless you say otherwise, fields are separated by whitespace, and the string a:b:c has no whitespace. So .. awk prints "d". And tee -a appends to your existing logfile, so what you're seeing is the original data, along with the d printed by awk. That's totally expected.
Second, it appears to have tee appending to the same file that awk is in the process of reading. This won't make an endless loop, as awk should stop reading the input file after whatever was the last byte when the file was opened, but it does mean you may have repeated data there.
Your other attempts, aside from some syntactical errors, all suffer from the same assumption that $4 means something that it does not.
The following awk snippet sets the input and output field separators to :, then sets the 4th field to "d", then prints the line.
$ echo "a:b:c" | awk 'BEGIN{FS=OFS=":"} {$4="d"} 1'
a:b:c:d
Is that what you want?
If you really do need to append this data to an existing log file, you can do so with tee -a or simple >> redirection. Just bear in mind that awk will only see the content of the file as of the time it was run, and by appending, you are not replacing lines.
One other thing. If you are actually hoping to use the content of the shell variable $imagedir inside awk, you should pass the variable in rather than exiting your quotes. For example:
$ echo "a:b:c" | awk -v d="foo/bar" 'BEGIN{FS=OFS=":"} {$4=d} 1'
a:b:c:foo/bar
sed "s|$|$imagedir|" file | tee newfile
This does the trick. Read 'file' and write the contents of 'file' with the substitution to a 'new file', so as to read the image directory when using a secondary standalone process.
Because the variable is a directory with several / these need to be escaped, so as not to interpret as sed delimiters. I had difficulty with this using a variable.
A neater option was to use an alternative delimiter. Not to be confused with the pipe that follows.

How to extract strings from a text in shell

I have a file name
"PHOTOS_TIMESTAMP_5373382"
I want to extract from this filename "PHOTOS_5373382" and add "ABC" i.e. finally want it to look like
"abc_PHOTOS_5373382" in shell script.
echo "PHOTOS_TIMESTAMP_5373382" | awk -F"_" '{print "ABC_"$1"_"$3}'
echo will provide input for awk command.
awk command does the data tokenization on character '_' of input using the option -F.
Individual token (starting from 1) can be accessed using $n, where n is the token number.
You will need the following sequence of commands directly on your shell, preferably bash shell (or) as a complete script which takes a single argument the file to be converted
#!/bin/bash
myFile="$1" # Input argument (file-name with extension)
filename=$(basename "$myFile") # Getting the absolute file-path
extension="${filename##*.}" # Extracting the file-name part without extension
filename="${filename%.*}" # Extracting the extension part
IFS="_" read -r string1 string2 string3 <<<"$filename" # Extracting the sub-string needed from the original file-name with '_' de-limiter
mv -v "$myFile" ABC_"$string1"_"$string3"."$extension" # Renaming the actual file
On running the script as
$ ./script.sh PHOTOS_TIMESTAMP_5373382.jpg
`PHOTOS_TIMESTAMP_5373382.jpg' -> `ABC_PHOTOS_5373382.jpg'
Although I like awk
Native shell solution
k="PHOTOS_TIMESTAMP_5373382"
IFS="_" read -a arr <<< "$k"
echo abc_${arr[0]}_${arr[2]}
Sed solution
echo "abc_$k" | sed -e 's/TIMESTAMP_//g'
abc_PHOTOS_5373382

Unix shell scripting, need assign the text files values to the sed command

i was trying to add the lines from the text file to the sed command
observered_list.txt
Uncaught SlingException
cannot render resource
IncludeTag Error
Recursive invocation
Reference component error
i need it to be coded like the following
sed '/Uncaught SlingException\|cannot render resource\|IncludeTag Error\|Recursive invocation\|Reference component error/ d'
help me to do this.
I would suggest you create a sed script and delete each pattern consecutively:
while read -r pattern; do
printf "/%s/ d;\n" "$pattern"
done < observered_list.txt >> remove_patterns.sed
# now invoke sed on the file you want to modify
sed -f remove_patterns.sed file_to_clean
Alternatively you could construct the sed command like this:
pattern=
while read -r line; do
pattern=$pattern'\|'$line
done < observered_list.txt
# strip of first and last \|
pattern=${pattern#\\\|}
pattern=${pattern%\\\|}
printf "sed '/%s/ d'\n" "$pattern"
# you still need to invoke the command, it's just printed
You can use grep for that:
grep -vFf /file/with/patterns.txt /file/to/process.txt
Explanation:
-v excludes lines of process.txt which match one of the patterns from output
-F treats patterns in patterns.txt as fixed strings instead of regexes (looks like this is desired here)
-f reads patterns from patterns.txt
Check man grep for further information.

Resources