grep listing false duplicates - bash

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.

the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

Related

Appending a count to a code in multiple files and saving the result

I'm looking for a bit of help here. I'm a complete newbie!
I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.
After some digging I have found-
perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml
but the issue is the counter does not reset in each file.
That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.
What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.
Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done
This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

Need to add text file entries to simple one line/command bash script

Basically I need to execute a curl command multiple times and redirect the output to a .csv file, each time the command is executed a term that is used in two separate places in the command is changed. I do have a list of these terms (arguments?) contained in a separate text file. Each time the command runs for a different term, the output needs to be appended to the file.
The command is basically:
curl "http://someURL/standardconditions+AND+(TERM_exact+OR+TERM_related)" > testfile.csv
So each time the command is run, TERM changes in both places (TERM_exact and TERM_related). As I mentioned, I have a text file that has a list of all 60 or so terms, what I want is the script to execute the command using the first term on the list, write the output to the specified .csv file and then repeat with the second term on the list, append that to the file and so on and so forth until it's been run for every single term.
I imagine there is a simple way to do this, I'm just not sure how.
Here's one way to do it,
This assumes that listFile.csv is your list of 60 items, and that each line is a comma-separated pair of values (no commas allowed in values!)
while IFS=, read exact related ; do
curl "http://someURL/standardconditions+AND+(TERM_${exact}+OR+TERM_${related})" >> testfile.csv
done < listFile.csv
It's not clear if you wanted one output file, or multiple.
You could replace the >> testfile.csv with >>testfile.${exact}_${related}.csv
to have separate files.
IHTH
You can set a variable to store TERM, and use concat function to get a string like
"http://someURL/standardconditions+AND+(TERM_exact+OR+TERM_related)", and run a python(may be other language)script including loop structure to handle 60 terms.

Using cloc (count Lines of Codes) result

I am writing a script for my research, and I want to get the total number of lines in a source file. I came around cloc and I think I am going to use it in my script.
However, cloc gives result with too many information (unfortunately since I am a new member I cannot upload a photo). It gives number of files, number of lines, number of blank lines, number of comment lines, and other graphical representation stuff.
I am only interested in the number of lines to use it on my calculations. Is there a way to get that number easily (maybe by performing some options in command line (although I went through the available options and didn't find something useful for my case))?
I thought to do regular expression on the result to get the number; however, this is my first time using cloc and there might be a better/professional way of doing it.
Any thought?
Regards,
Arwa
I am not sure about CLOC. But it is worth using default shell command.
Please have a look at this question.
To get number of lines of code individually
find . -name '*.*' | xargs wc -l
To get total number of lines of code in a directory.
(find ./ -name '*.*' -print0 | xargs -0 cat) | wc -l
Please note that if you need number of lines from files with specific extension you could use *.ext. *.rb, if it is ruby.
For something very quick and simple you could just use:
Dir.glob('your_directory/**/*.rb').map do |file|
File.foreach(file).count
end.reduce(:+)
This will count all the lines of .rb files in your_directory and it's sub directories. Although I would recommend adding some handling for blank lines as well as comment lines. For more on Dir::glob
#BinaryMee and #engineersmnky thanks for your response.
I tried two different solutions, one using "readlines" got the answer from #gicappa
Count the length (number of lines) of a CSV file?
the other solution using cloc. I ran the command
%x{perl #{ClocPath} #{path-to-file} > result.txt}
and saved the result in result.txt
cloc returns result in a graphical form (I cannot upload image), it also reports number of blank lines, comment lines, and code lines. As I said, I am interested in code lines. So, I opened the file and used regular expression to get the number I needed.
content = File.read("#{path}/result.txt")
line = content.scan(/(\s+\d+\s+\d+\s+\d+\s+\d+)/)
total = line[0][0].split(' ').last
content here will have the content of a file, then line will get this line from the file:
C# 1 3 3 17
C# is the language of a file, 1 is number of files, 3 is number of blank lines, 3 is number of comment lines, and 17 is number of code lines. I got the help of the format from the script of cloc. total then will have number 17.
This solution will help if you are reading a specific file only, you need to add more solutions if you are reading the lines of more than one file.
Hopefully this will help who needs it.
Regards,
Arwa

method for merging two files, opinion needed

Problem: I have two folders (one is Delta Folder-where the files get updated, and other is Original Folder-where the original files exist). Every time the file updates in Delta Folder I need merge the file from Original folder with updated file from Delta folder.
Note: Though the file names in Delta folder and Original folder are unique, but the content in the files may be different. For example:
$ cat Delta_Folder/1.properties
account.org.com.email=New-Email
account.value.range=True
$ cat Original_Folder/1.properties
account.org.com.email=Old-Email
account.value.range=False
range.list.type=String
currency.country=Sweden
Now, I need to merge Delta_Folder/1.properties with Original_Folder/1.properties so, my updated Original_Folder/1.properties will be:
account.org.com.email=New-Email
account.value.range=True
range.list.type=String
currency.country=Sweden
Solution i opted is:
find all *.properties files in Delta-Folder and save the list to a temp file(delta-files.txt).
find all *.properties files in Original-Folder and save the list to a temp file(original-files.txt)
then i need to get the list of files that are unique in both folders and put those in a loop.
then i need to loop each file to read each line from a property file(1.properties).
then i need to read each line(delta-line="account.org.com.email=New-Email") from a property file of delta-folder and split the line with a delimiter "=" into two string variables.
(delta-line-string1=account.org.com.email; delta-line-string2=New-Email;)
then i need to read each line(orig-line=account.org.com.email=Old-Email from a property file of orginal-folder and split the line with a delimiter "=" into two string variables.
(orig-line-string1=account.org.com.email; orig-line-string2=Old-Email;)
if delta-line-string1 == orig-line-string1 then update $orig-line with $delta-line
i.e:
if account.org.com.email == account.org.com.email then replace
account.org.com.email=Old-Email in original folder/1.properties with
account.org.com.email=New-Email
Once the loop finishes finding all lines in a file, then it goes to next file. The loop continues until it finishes all unique files in a folder.
For looping i used for loops, for splitting line i used awk and for replacing content i used sed.
Over all its working fine, its taking more time(4 mins) to finish each file, because its going into three loops for every line and splitting the line and finding the variable in other file and replace the line.
Wondering if there is any way where i can reduce the loops so that the script executes faster.
With paste and awk :
File 2:
$ cat /tmp/l2
account.org.com.email=Old-Email
account.value.range=False
currency.country=Sweden
range.list.type=String
File 1 :
$ cat /tmp/l1
account.org.com.email=New-Email
account.value.range=True
The command + output :
paste /tmp/l2 /tmp/l1 | awk '{print $NF}'
account.org.com.email=New-Email
account.value.range=True
currency.country=Sweden
range.list.type=String
Or with a single awk command if sorting is not important :
awk -F'=' '{arr[$1]=$2}END{for (x in arr) {print x"="arr[x]}}' /tmp/l2 /tmp/l1
I think your two main options are:
Completely reimplement this in a more featureful language, like perl.
While reading the delta file, build up a sed script. For each line of the delta file, you want a sed instruction similar to:
s/account.org.com.email=.*$/account.org.email=value_from_delta_file/g
That way you don't loop through the original files a bunch of extra times. Don't forget to escape & / and \ as mentioned in this answer.
Is using a database at all an option here?
Then you would only have to write code for extracting data from the Delta files (assuming that can't be replaced by a database connection).
It just seems like this is going to keep getting more complicated and slower as time goes on.

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT
Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.
More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs
If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Resources