Find variety of characters in text document - bash

I have a CSV document with 47001 lines in it. Yet when I open it in Excel, there are only 31641 lines.
I know that 47001 is the correct number of lines; it's an export of a database table, whose size I know to be 47001. Additionally: wc -l my.csv returns 47001.
So, Excel's parsing fails. I suspect there is some funky control or whitespace character somewhere in this document.
How do I find out the variety of characters used in some document?
For example, consider this input file: ABCAAAaaa\n.
I would expect the alphabet of characters used in the file to be : ABCa\n.
Maybe if we compress it, we can somehow read the Huffman Tree?
I suspect it will be educational to compare the UTF-8 character variety versus the ASCII character variety. For example: Excel may parse multi-byte characters in ASCII, and thus interpret some bytes as control codepoints.

Here we go if you are on linux (the logic behind could be the same for all but for linux i give the command ) :
sed 's/./&\n/g' | sort -u | tr -d '\n'
What happend :
- First replace all letter by letter followed by "\n" [new line]
- Then sort all caracter and print uniq occurrences
- Remove all the "\n"
Then the input file :
ABCAAAaaa
will became :
A
B
C
A
A
A
a
a
a
After sort :
a
a
a
A
A
A
A
B
C
Then after uniq :
A
B
C
a
final output :
aABC

You can cut out of the original files some columns which are not likely to be changed by passing the cycle of being parsed and written out again, e. g. a pure text column like a name or a number. Names would be great. Then let this file pass the cycle and compare it to the original:
Here's the code:
cut -d, -f3,6,8 > columns.csv
This assumes that columns 3, 6, and 8 are the name columns and that a comma is the separator. Adjust these values according to your input file. Using a single column is also okay.
Now call Excel, parse the file columns.csv, write it out again as a csv file columns2.csv (with the same separator of course). Then:
diff columns.csv columns2.csv | less
A tool like meld instead of diff might also be handy to analyse the differences.
This will show you which lines experienced a change by the →parse→dump cycle. Hopefully it will affect only the lines you are looking for.

Related

sort -o appends newline to end of file - why?

I'm working on a small text file with a list of words in it that I want to add a new word to, and then sort. The file doesn't have a newline at the end when I start, but does after the sort. Why? Can I avoid this behavior or is there a way to strip the newline back out?
Example:
words.txt looks like
apple
cookie
salmon
I then run printf "\norange" >> words.txt; sort words.txt -o words.txt
I use printf rather than echo figuring that'll avoid the newline, but the file then reads
apple
cookie
orange
salmon
#newline here
If I just run printf "\norange" >> words.txt orange appears at the bottom of the file, with no newline, ie;
apple
cookie
salmon
orange
This behavior is explicitly defined in the POSIX specification for sort:
The input files shall be text files, except that the sort utility shall add a newline to the end of a file ending with an incomplete last line.
As a UNIX "text file" is only valid if all lines end in newlines, as also defined in the POSIX standard:
Text file - A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the newline character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.
Think about what you are asking sort to do.
You are asking it "take all the lines, and sort them in order."
You've given it a file containing four lines, which it splits to the following strings:
"salmon\n"
"cookie\n"
"orange"
It sorts these for you dutifully:
"cookie\n"
"orange"
"salmon\n"
And it then outputs them as a single string:
"cookie
orangesalmon
"
That is almost certainly exactly what you do not want.
So instead, if your file is missing the terminating newline that it should have had, the sort program understands that, most likely, you still intended that last line to be a line, rather than just a fragment of a line. It appends a \n to the string "orange", making it "orange\n". Then it can be sorted properly, without "orange" getting concatenated with whatever line happens to come immediately after it:
"cookie\n"
"orange\n"
"salmon\n"
So when it then outputs them as a single string, it looks a lot better:
"cookie
orange
salmon
"
You could strip the last character off the file, the one from the end of "salmon\n", using a range of handy tools such as awk, sed, perl, php, or even raw bash. This is covered elsewhere, in places like:
How can I remove the last character of a file in unix?
But please don't do that. You'll just cause problems for all other utilities that have to handle your files, like sort. And if you assume that there is no terminating newline in your files, then you will make your code brittle: any part of the toolchain which "fixes" your error (as sort kinda does here) will "break" your code.
Instead, treat text files the way they are meant to be treated in unix: a sequence of "lines" (strings of zero or more non-newline bytes), each followed by a newline.
So newlines are line-terminators, not line-separators.
There is a coding style where prints and echos are done with the newline leading. This is wrong for many reasons, including creating malformed text files, and causing the output of the program to be concatenated with the command prompt. printf "orange\n" is correct style, and also more readable: at a glance someone maintaining your code can tell you're printing the word "orange" and a newline, whereas printf "\norange" looks at first glance like it's printing a backslash and the phrase "no range" with a missing space.

Slice keywords from log text files

I have a big log file with lines as
[2016-06-03T10:03:12] No data: TW.WA2
,
[2016-06-03T11:03:02] wrong overlaps: XW.W12.HHZ.2007.289
and as
[2016-06-03T14:05:26] failed to correct YP.CT02.HHZ.2012.334 because No matching response.
Each line consists of a timestamp, a reason for the logging and a keyword composed of some substrings connected by dots (TW.WA2, XW.W12.HHZ.2007.289 and YP.CT02.HHZ.2012.334 in above examples).
The format of the keywords of a specific type is fixed (substrings are joined by fixed number of dots).
The substrings are composed of letters and digits (0-5 chars, but not all substrings can be empty, generally only one at maximum, e.g., XW.WTA12..2007.289).
I want to
extract the keywords
save different types of keywords uniqued to separated files
Currently I tried grep, but only the classification is done.
grep "wrong overlaps" logfile > wrong_overlaps
grep "failed to correct" logfile > no_resp
grep "No data" logfile > no_data
In no_data, the contents are expected as like
AW.AA1
TW.WA2
TW.WA3
...
In no_resp, the contents are expected as like
XP..HHZ.2002.334
YP.CT01.HHZ.2012.330
YP.CT02.HHZ.2012.334
...
However, the simple grep commands above save the full lines. I guess I need regex to extract the keywords?
Assuming a keyword is defined by containing period and surrounded by letters and digits, then the followed regex will match all keywords:
% grep -oE '\w+(\.\w+)+' data
TW.WA2
XW.W12.HHZ.2007.289
YP.CT02.HHZ.2012.334
-o will print the matches only. And -E enables Extended Regular Expressions
This will however not make it possible for you to split it into multiply files, eg: Creating a file wrong_overlaps that contains all lines with wrong overlaps.
You can use -P to enable Perl Compatible Regular Expressions which support lookbehinds:
% grep -oP '(?<=wrong overlaps: )\w+(\.\w+)+' data
XW.W12.HHZ.2007.289
But note that PCRE doesn't support variable length lookbehinds so you will need to type out the full pattern before, eg:
something test string: ABC:DEF
ABC:DEF Can be extracted with:
(?<=test string: )\w+(\.\w+)+
But not
(?<=test string)\w+(\.\w+)+

How can i get only special strings (by condition) from file?

I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition?
for example, file contents:
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"
[2/Nov/2015][rule="mySecondRule"]"GET
http://anotheruselesssotialnetwork.com/picturewithdog.jpg"
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithzombie.jpg"
and i only need string with "myRule" and "cat"?
I think it should be perl, or bash, but it doesn't matter.
Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed:
sed -n '/myRule/ {N }; /myRule.*cat/ {p}'
the first rule appends the nextline to patternspace when myRule matches
the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs
This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline
perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt
and this is the second
perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt
When given your sample data as input, both methods produce this output
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"

Split text file into multiple files

I am having large text file having 1000 abstracts with empty line in between each abstract . I want to split this file into 1000 text files.
My file looks like
16503654 Three-dimensional structure of neuropeptide k bound to dodecylphosphocholine micelles. Neuropeptide K (NPK), an N-terminally extended form of neurokinin A (NKA), represents the most potent and longest lasting vasodepressor and cardiomodulatory tachykinin reported thus far.
16504520 Computer-aided analysis of the interactions of glutamine synthetase with its inhibitors. Mechanism of inhibition of glutamine synthetase (EC 6.3.1.2; GS) by phosphinothricin and its analogues was studied in some detail using molecular modeling methods.
You can use split and set "NUMBER lines per output file" to 2. Each file would have one text line and one empty line.
split -l 2 file
Something like this:
awk 'NF{print > $1;close($1);}' file
This will create 1000 files with filename being the abstract number. This awk code writes the records to a file whose name is retrieved from the 1st field($1). This is only done only if the number of fields is more than 0(NF)
You could always use the csplit command. This is a file splitter but based on a regex.
something along the lines of :
csplit -ks -f /tmp/files INPUTFILENAMEGOESHERE '/^$/'
It is untested and may need a little tweaking though.
CSPLIT

display consolidated list of numbers from a CSV using BASH

I was sent a large list of URL's in an Excel spreadsheet, each unique according to a certain get variable in the string (who's value is a number ranging from 5-7 numbers in length). I am having to run some queries on our databases based on those numbers, and don't want to have to go through the hundreds of entries weeding out the numbers one-by-one. What BASH commands that can be used to parse out the number from each line (it's the only number in each line) and consolidate it down to one line with all the numbers, comma separated?
A sample (shortened) listing of the CVS spreadsheet includes:
http://www.domain.com/view.php?fDocumentId=123456
http://www.domain.com/view.php?fDocumentId=223456
http://www.domain.com/view.php?fDocumentId=323456
http://www.domain.com/view.php?fDocumentId=423456
DocumentId=523456
DocumentId=623456
DocumentId=723456
DocumentId=823456
....
...
The change of format was intentional, as they decided to simply reduce it down to the variable name and value after a few rows. The change of the get variable from fDocumentId to just DocumentId was also intentional. Ideal output would look similar to:
123456,23456,323456,423456,523456,623456,723456,823456
EDIT: my apologies, I did not notice that half way through the list, they decided to get froggy and change things around, there's entries that when saved as CSV, certain rows will appear as:
"DocumentId=098765 COMMENT, COMMENT"
DocumentId=898765 COMMENT
DocumentId=798765- COMMENT
"DocumentId=698765- COMMENT, COMMENT"
With several other entries that look similar to any of the above rows. COMMENT can be replaced with a single string of (upper-case) characters no longer than 3 characters in length per COMMENT
Assuming the variable always on it's own, and last on the line, how about just taking whatever is on the right of the =?
sed -r "s/.*=([0-9]+)$/\1/" testdata | paste -sd","
EDIT: Ok, with the new information, you'll have to edit the regex a bit:
sed -r "s/.*f?DocumentId=([0-9]+).*/\1/" testdata | paste -sd","
Here anything after DocumentId or fDocumentId will be captured. Works for the data you've presented so far, at least.
More simple than this :)
cat file.csv | cut -d "=" -f 2 | xargs
If you're not completely committed to bash, the Swiss Army Chainsaw will help:
perl -ne '{$_=~s/.*=//; $_=~s/ .*//; $_=~s/-//; chomp $_ ; print "$_," }' < YOUR_ORIGINAL_FILE
That cuts everything up to and including an =, then everything after a space, then removes any dashes. Run on the above input, it returns
123456,223456,323456,423456,523456,623456,723456,823456,098765,898765,798765,698765,

Resources