I have a process that inserts data into PDFs that eventually loads into a system that gets searched based on that inserted data. The inserted data looks something like:
<<
/IBM-ODIndexes
<< /Private
<<
/DOB (05031983)
/FULL_NAME (TEST USER)
/YEAR (2020)
>>
/LastModified(D:20210112201530)
>>
However, there are instances where the data in the FULL_NAME field contains non UTF8 characters and then users are unable to search the data. Specifically apostrophes come over from Microsoft Word and then gets interpreted like this:
/FULL_NAME (JERRY OÃ<83>¢ââ<80><9a>‰â<80><9e>¢CONNELL)
In this case I am looking to strip out the apostrophe that is represented as Ã<83>¢ââ<80><9a>‰â<80><9e>¢ and replace it with a white space.
There are several complexities here, but in general I would say that the only reliable way to deal with it is to figure out the text encoding of the incoming document and converting it to the target encoding.
Ã<83>¢ââ<80><9a>‰â<80><9e>¢ is 34 characters (that is, at least 34 bytes), and no single encoding ever used that much space for a single character. What’s probably happening is multiple levels of encoding, such as HTML entities, base64, UTF-8/16/32 or escape characters like %% to represent % in SQL or \\ to represent \ in Bash. Reversing all these levels of encoding manually is going to involve quite a lot of reading the huge docx standard. The simpler alternative is to use a library which can just convert the entire text into a known character encoding for you, at which point you have to do at most a single conversion into UTF-8.
Another argument for this is that the “apostrophe string” does contain otherwise harmless characters like “a” and “e”. Without at least some understanding of the encodings you’re unlikely to be able to separate encoded characters from non-encoded ones, which would make the resulting text full of invalid text.
I'm currently strugling to clean csv files generated automatically with fields containing the csv separator and the field delimiter using sed or awk or via a script.
The source software has no settings to play with to improve the situation.
Format of the csv:
"111111";"text";"";"text with ; and " sometimes "; or ;" multiple times";"user";
Fortunately, the csv is "well" formatted, the exporting software just doesn't escape or replace "forbidden" chars from the fields.
In the last few days I tried to improve my knowledge of regular expression and find expression to clean the files but I failed.
What I managed to do so far:
RegEx to find the fields (I wanted to find the fields and perform a replace inside but I didn't find a way to do it)
(?:";"|^")(.*?)(?=";"|";\n)
RegEx that find semicolon, does not work if the semicolon is the last char of the field only find one per field.
(?:^"|";")(?:.*?)(;)(?:[^"\n].*?)(?=";"|";\n)
RegEx to find the double quotes, seems to pick the first double quote of the line in online regex testers
(?:^"|";")(?:.*?)[^;](")(?:[^;].*?)(?=";"|";\n)
I thought of adding space between each chars in the fields then searching for lonely semi colon and double quotes and remove single space after that but I don't know if it's even possible and seems like a poor solution anyway.
Any standard library should be able to handle it if there is no explicit error in the CSV itself. This is why we have quote-characters and escape characters.
When you create a CSV by yourself - you may forgot handling such cases and let your final output file use this situation. AWK is not a CSV reader but simply a text processing utility.
This is what your row should rather look like.
"111111";"text";"";"text with \; and \" sometimes \"; or ;\" multiple times";"user";
So if you can still re-fetch the data, find a way to export the CSV either through the database's own functionality of csv library for the languages you work with.
In python, this would look like this:-
mywriter = csv.writer(csvfile, delimiter=';', quotechar='"', escapechar="\\")
But if you can't create csv again, the only hope is that you expect some pattern within the fields, as in this question:- parse a csv file that contains commans in the fields with awk
But this is rarely true in textual data - esp comments or posts on a webpage. Another idea in such situations would be to use '\t' as separator.
Is there any character that is guaranteed not to appear in any file path on Windows or Unix/Linux/OS X?
I need this because I want to join together a few file paths into a single string, and then split them apart again later.
In the comments, Harry Johnston writes:
The generic solution to this class of problem is to encode the file paths before joining them. For example, if you're dealing with single-byte strings, you could convert them to hex strings; so "hello" becomes "68656c6c6f". (Obviously that isn't the most efficient solution!)
That is absolutely correct. Please don't try to do anything "tricky" with filenames and reserved characters, because it will eventually break in some weird corner case and your successor will have a heck of a time trying to repair the damage.
In fact, if you're trying to be portable, I strongly recommend that you never attempt to create any filenames including any characters other than [a-z0-9_]. (Consider that common filesystems on both Windows and OS X can operate in case-insensitive mode, where FooBar.txt and FOOBAR.TXT are the same identifier.)
A decently compact encoding scheme for practical use would be to make a "whitelisted set" such as [a-z0-9_], and encode any character ch outside your "whitelisted set" as printf("_%2x", ch). So hello.txt becomes hello_2etxt, and hello_world.txt becomes hello_5fworld_2etxt.
Since every _ is escaped, you can use double-_ as a separator: the encoded string hello_2etxt__goodbye___2e_2e uniquely identifies the list of filenames ['hello.txt', 'goodbye', '..'].
You can use a newline character, or specifically CR (decimal code 13) or LF (decimal code 10) if you like. Whether this is suitable or not depends on what requirements you have with regard to displaying the concatenated string to the user - with this approach, it will print its parts on separate lines - which may be very good or very bad for the purpose (or you may not care...).
If you need the concatenated string to print on a single line, edit your question to specify this additional requirement; and we can go from there then.
For some reason I need to fetch emails from /var/mail/username file. It seems like an append only file.
My question is, is it safe to parse the content of the /var/email/username file depending on the first line From username#host Mon Jun 20 16:50:15 2011? What if the similar pattern found inside the email body?
Furthermore, is there any opensource ruby script available for reference?
Yes, that seems like more or less the right way to parse the mbox format - from a quick scan of the RFC specification:
The structure of the separator lines
vary across implementations, but
usually contain the exact character
sequence of "From", followed by a
single Space character (0x20), an
email address of some kind, another
Space character, a timestamp sequence
of some kind, and an end-of- line
marker.
And...
Many implementations are also known
to escape message body lines that
begin with the character sequence of
"From ", so as to prevent confusion
with overly-liberal parsers that do
not search for full separator
lines. In the common case, a leading
Greater-Than symbol (0x3E) is used
for this purpose (with "From "
becoming ">From "). However, other
implementations are known not to
escape such lines unless they are
immediately preceded by a blank line
or if they also appear to contain
an email address and a timestamp.
Other implementations are also
known to perform secondary escapes
against these lines if they are
already escaped or quoted, while
others ignore these mechanisms
altogether.
Update:
There's also this: https://github.com/meh/ruby-mbox
For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor