display consolidated list of numbers from a CSV using BASH - bash

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT

Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.

More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs

If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Related

Appending a count to a code in multiple files and saving the result

I'm looking for a bit of help here. I'm a complete newbie!
I need to look in a file for a code matching the pattern A00000_00_A and append a count to it, so the first time it appears it is replaced with A00000_00_A_001, second time A00000_00_A_002 etc. The output needs to be written back to the same file. Each file only contains 1 code, but it appears multiple times.
After some digging I have found-
perl -pi -e 's/Q\d{4,5}'_'\d{2}_./$&.'_'.++$A /ge' /users/documents/*.xml
but the issue is the counter does not reset in each file.
That is, the output of the first file is say Q00390_01_A_1 to Q00390_01_A_7, while the second file is Q00391_01_A_8 to Q00391_01_A_10.
What I want is Q00390_01_A_1 to Q00390_01_A_7 in the first file and Q00391_01_A_1 to Q00391_01_A_2 in the second.
Does anyone have any idea on how to edit the above code to make it do that? I'm a total newbie so ideally an edit to what I have would be brilliant. Thanks
cd /users/documents/
for f in *.xml;do
perl -pi -e 's/facs=.(Q|M)\d{4,5}_\d{2}_\w/$&._.sprintf("%04d",++$A) /ge' $f
done
This matches the string facs= and any character, then "Q" or "M" followed by either four or five digits, then an underscore, then two digits, another underscore, and a word character. The entire match is then concatenated with an underscore and the value of $A zero padded to four digits.

grep listing false duplicates

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

Find variety of characters in text document

I have a CSV document with 47001 lines in it. Yet when I open it in Excel, there are only 31641 lines.
I know that 47001 is the correct number of lines; it's an export of a database table, whose size I know to be 47001. Additionally: wc -l my.csv returns 47001.
So, Excel's parsing fails. I suspect there is some funky control or whitespace character somewhere in this document.
How do I find out the variety of characters used in some document?
For example, consider this input file: ABCAAAaaa\n.
I would expect the alphabet of characters used in the file to be : ABCa\n.
Maybe if we compress it, we can somehow read the Huffman Tree?
I suspect it will be educational to compare the UTF-8 character variety versus the ASCII character variety. For example: Excel may parse multi-byte characters in ASCII, and thus interpret some bytes as control codepoints.
Here we go if you are on linux (the logic behind could be the same for all but for linux i give the command ) :
sed 's/./&\n/g' | sort -u | tr -d '\n'
What happend :
- First replace all letter by letter followed by "\n" [new line]
- Then sort all caracter and print uniq occurrences
- Remove all the "\n"
Then the input file :
ABCAAAaaa
will became :
A
B
C
A
A
A
a
a
a
After sort :
a
a
a
A
A
A
A
B
C
Then after uniq :
A
B
C
a
final output :
aABC
You can cut out of the original files some columns which are not likely to be changed by passing the cycle of being parsed and written out again, e. g. a pure text column like a name or a number. Names would be great. Then let this file pass the cycle and compare it to the original:
Here's the code:
cut -d, -f3,6,8 > columns.csv
This assumes that columns 3, 6, and 8 are the name columns and that a comma is the separator. Adjust these values according to your input file. Using a single column is also okay.
Now call Excel, parse the file columns.csv, write it out again as a csv file columns2.csv (with the same separator of course). Then:
diff columns.csv columns2.csv | less
A tool like meld instead of diff might also be handy to analyse the differences.
This will show you which lines experienced a change by the →parse→dump cycle. Hopefully it will affect only the lines you are looking for.

Extracting lines with specific character count

I have a python script that is pulling URLs from pastebin.com/archive, which has links to pastes (which have 8 random digits after pastbin.com in the url). My current output is a .txt with the below data in it, I only want the links to pastes present (Example: http://pastebin.com///Y5JhyKQT) and not links to other pages such as pastebin.com/tools). This is so I can set wget to go pull each individual paste.
The only way I can think of doing this is writing a bash script to count the number of characters in each line and only keep lines with 30 characters exactly (this is the length of the URLs linking to pastes).
I have no idea how I'd go about implementing something like this using grep or awk, perhaps using a while do loop? Any help would be appreciated!
http://pastebin.com///tools
http://pastebin.com//top.location.href
http://pastebin.com///trends
http://pastebin.com///Y5JhyKQT <<< I want to keep this
http://pastebin.com//=
http://pastebin.com///>
From the sample you posted it looks like all you need is:
grep -E '/[[:alnum:]]{8}$' file
or maybe:
grep -E '^.{30}$' file
If that doesn't work for you, explain why and provide a better sample.
This is the algorithm
Find all characters between new line characters or read one line at a time.
Count them or store them in variable and get its count. This is the length of your line.
Only process those lines that are exactly same count as you want.
In python there is both functions character count of string and reading line as well.
#!/usr/bin/env zsh
while read aline
do
if [[ ${#aline} == 30 ]]; then
#do something
fi
done
This is documented in the bash man pages under the "Parameter Expansion" section.
EDIT=this solution is zsh-only

How can i get only special strings (by condition) from file?

I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition?
for example, file contents:
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"
[2/Nov/2015][rule="mySecondRule"]"GET
http://anotheruselesssotialnetwork.com/picturewithdog.jpg"
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithzombie.jpg"
and i only need string with "myRule" and "cat"?
I think it should be perl, or bash, but it doesn't matter.
Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed:
sed -n '/myRule/ {N }; /myRule.*cat/ {p}'
the first rule appends the nextline to patternspace when myRule matches
the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs
This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline
perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt
and this is the second
perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt
When given your sample data as input, both methods produce this output
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"

Resources