Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Are there any good sites/services to validate consistency of CSV file ?
The same as W3C validator but for CSV ?
Thanks!
I recently came across Google Refine (now OpenRefine) - it's not a service for validating CSV files, it's a tool you download locally, but it does provide a lot of tools for working with data and detecting anomalies.
As mentioned in a reply, "CSV" has become an ill-defined term, principally because people don't follow the One True Way when using delimiter separated data
http://www.catb.org/~esr/writings/taoup/html/ch05s02.html
EDIT/UPDATE (2016-08-09):
CSV Currently Becoming a Well-Defined Term by the W3C CSV Working Group
The Open Data Institute is developing a CSV validation service that will allow users to check the structure of their data as well as validate it against a simple schema.
The service is still very much in alpha but can be found here:
http://csvlint.io/
The code for the application and the underlying library are both open source:
https://github.com/theodi/csvlint
https://github.com/theodi/csvlint.rb
The README in the library provides a summary of the errors and warnings that can be generated. The following types of error can be reported:
:wrong_content_type -- content type is not text/csv
:ragged_rows -- row has a different number of columns (than the first row in the file)
:blank_rows -- completely empty row, e.g. blank line or a line where all column values are empty
:invalid_encoding -- encoding error when parsing row, e.g. because of invalid characters
:not_found -- HTTP 404 error when retrieving the data
:quoting -- problem with quoting, e.g. missing or stray quote, unclosed quoted field
:whitespace -- a quoted column has leading or trailing whitespace
The following types of warning can be reported:
:no_encoding -- the Content-Type header returned in the HTTP request does not have a charset parameter
:encoding -- the character set is not UTF-8
:no_content_type -- file is being served without a Content-Type header
:excel -- no Content-Type header and the file extension is .xls
:check_options -- CSV file appears to contain only a single column
:inconsistent_values -- inconsistent values in the same column. Reported if <90% of values seem to have same data type (either numeric or alphanumeric including punctuation)
The National Archives developed a CSV Schema Language and CSV Validator, software written in Java. It's open source.
To validate a CSV file I use the RAINBOW CSV extension in Visual Studio Code and also I open the CSV file in Excel.
There is a great way to validate your CSV file.I am referring to this article, where the whole process is explained in tiniest details.
The validation process has two steps: the first one is to post the file to the API. Once your file is accepted,the API returns a polling endpoint that contains the results of the validation process.10 MB limit per file.
CSV Lint at csvlint.com (not .io :) is a service we're building to solve this problem. It checks CSV files against user-defined validation rules / schemas cell by cell.
We spent a lot of time tweaking the UI to allow users to create complex validation rules / schemas easily that meet their business needs without a single line of code.
Our offline validation feature allows users to see the results in-realtime even when validating multiple large size (with millions+ rows) files, and most importantly it 100% protects user data privacy.
Toolkit Bay CSV Validator & Linter online, easy to use, set delimiter and go.
Flatfile CSV validator online demo, automatic delimiter detection, upload and go.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I know what JSON is and what the advantages over XML. I already read some answer for this , but still i can't get through it.
So i would specifically ask this questions:
1. Is it only useful for API thing? so exchange data without refresh the whole page using AJAX..
2. Is it always used with AJAX?
3. Do people (always/very often) using JSON like this? :: Database/Server - JSON - Client.. what i mean by that is, all our data from database will be put into JSON, so people can use it easily to any other platform/language?
**because from my point of view, if the data, which we need to output not much, why not just directly write on HTML directly, and if it's a lot of data, why not use database? If you don't mind please add an example case to use json
big thanks everyone!
Because JSON is a lightweight data interchange format, it's uses vary widely. You describe using it for an API, which would be an idea situation to use JSON output over something like XML.
To specifically answer your questions:
It's not just useful for an API. It can be used to create configuration (for example, Composer's JSON configuration file). It can also be used for basic output to easily read with languages like JavaScript, since JSON is native to JavaScript as an object. (JavaScript Object Notation).
It's not always used for AJAX. Say you were building a PHP application to convert currency, and you wanted to read from an API that output as JSON, this would always be preferred. Because languages like PHP have the ability to encode and decode JSON, you could read from the API (or other source) and decode it, giving you a PHP object or array of the JSON data.
I think you mean reading from a database, outputting that in JSON format and then allowing clients to read it using an API. In this case, it's not always the way it's used - but if I had to guess, it's the most common way it's used, and probably most useful.
the JSON in my opinion, when you get some data from netWork , you can use the JSON to describe your data . JSON only is a data format. it isn't always used with AJAX . it's only a format. It contains array and dictionary.
Currently i'm working on transforming a xml file to delimited seperated file.I was pondering over the idea of representing multiple values of an attribute field..Currently my idea is to represent the values as below:
First Name;Last Name;E-mail id;Description
Fresher;user1;"|email1#abc.com|;|email2#abc.com|";This user joined as fresher.
My question is;Is there is a standard followed for representation of multiple values.?
How is this scenario taken care in common spreadsheet programs available such as Microsoft excel,openoffice calc and lotus notes 123 when imported into .csv file..??
Based on this i want to make changes to my xslt code..
Appreciate any help in this regard..
According to my experiences it is always good to stick to database normalisation standards. There are a lot of information everywhere in the web for further references.
a) When looking in your proposal what I like is to separate each column with semicolon instead of comma. It's easier to import data to any system later especially when you will deal with different (national) standards of number separation symbols
b) However, which I don't like is the 'e-mail' section. There would be a problem in the following areas:
quotation marks are problems- try to avoid them.
don't separate inside e-mail addresses with the same mark as for column separation. Therefore you shouldn't use semicolon there (what I guess- you can have one or few e-mails for each record).
If you can't introduce database normalisation standards I would propose the following small improvements to your idea:
Fresher;user1;email1#abc.com|email2#abc.com;This user joined as fresher
If you provide that kind of data file I think each of vba user would be able to import it to Excel (or any other system) easily and quickly.
I'm looking for general UI advice on importing a CSV file. The UI is done in ASP.NET MVC3.
When the user uploads the file I need to validate it and allow them to manually correct any errors within the browser before I store it in the database. There's so many potential errors to check for and I'm really not sure what the best way is to achieve this. Another thing is that I only have a few days to implement this so it can't be too complicated. I'm fine with regular expressions and programming and I already have the posted file stream available, but I just can't think of a good and practical way to present this functionaly to the user.
Hope someone can inspire me. Many thanks.
There are some suggestions here:
Reading a CSV file in .NET?
Of these, we chose to use Linq2CSV in our MVC projects.
http://www.codeproject.com/KB/linq/LINQtoCSV.aspx
It is fairly easy to use, and validation is nice. You define a simple class that lays out the structure (columns) of the csv file. It will do basic validation, and if that passed, we sent it through a Validator that used DataAnnotation attributes to validate against more complex rules. We found it reliable, and we were able to add some features to it that we wanted.
If the file was pathologically bad, we'd fail the whole thing and present a single error message. If the file was reasonably sound, we would display the rows in error along with the error messages for the row so they could see the problem in context. In our case, this was a display grid only - we did not allow editing through the website - because the CSVs were being generated out of their data system, and we needed them to edit the source data in their system and regenerate the CSV. To do in place editing, you would need to stage all the column values as strings so they can fix numbers that don't parse, etc.
Is there a standard or open format which can be used to describe the formating of a flat file. My company integrates many different customer file formats. With an XML file it's easy to get or create an XSD to describe the XML file format. I'm looking for something similar to describe a flat file format (fixed width, delimited etc). Stylus Studio uses a proprietary .conv format to do this. That .conv format can be used at runtime to transform an arbitrary flat file to an XML file. I was just wondering if there was any more open or standards based method for doing the same thing.
I'm looking for one method of describing a variety of flat file formats whether they are fixed width or delimited, so CSV is not an answer to this question.
XFlat:
http://www.infoloom.com/gcaconfs/WEB/philadelphia99/lyons.HTM#N29
http://www.unidex.com/overview.htm
For complex cases (e.g. log files) you may consider a lexical parser.
About selecting existing flat file formats: There is the Comma-separated values (CSV) format. Or, more generally, DSV. But these are not "fixed-width", since there's a delimiter character (such as a comma) that separates individual cells. Note that though CSV is standardized, not everybody adheres to the standard. Also, CSV may be to simple for your purposes, since it doesn't allow a rich document structure.
In that respect, the standardized and only slightly more complex (but thus more useful) formats JSON and YAML are a better choice. Both are supported out of the box by plenty of languages.
Your best bet is to have a look at all languages listed as non-binary in this overview and then determine which works best for you.
About describing flat file formats: This could be very easy or difficult, depending on the format. Though in most cases easier solutions exist, one way that will work in general is to view the file format as a formal grammar, and write a lexer/parser for it. But I admit, that's quite† heavy machinery.
If you're lucky, a couple of advanced regular expressions may do the trick. Most formats will not lend themselves for that however.‡ If you plan on writing a lexer/parser yourself, I can advise PLY (Python Lex-Yacc). But many other solutions exists, in many different languages, a lot of them more convenient than the old-school Lex & Yacc. For more, see What parser generator do you recommend?
†: Yes, that may be an understatement.
‡: Even properly describing the email address format is not trivial.
COBOL (whether you like it or not) has a standard format for describing fixed-width record formats in files.
Other file formats, however, are somewhat simpler to describe. A CSV file, for example, is just a list of strings. Often the first row of a CSV file is the column names -- that's the description.
There are examples of using JSON to formulate metadata for text files. This can be applied to JSON files, CSV files and fixed-format files.
Look at http://www.projectzero.org/sMash/1.1.x/docs/zero.devguide.doc/zero.resource/declaration.html
This is IBM's sMash (Project Zero) using JSON to encode metadata. You can easily apply this to flat files.
At the end of the day, you will probably have to define your own file standard that caters specifically to your storage needs. What I suggest is using xml, YAML or JSON as your internal container for all of the file types you receive. On top of this, you will have to implement some extra validation logic to maintain meta-data such as the column sizes of the fixed width files (for importing from and exporting to fixed width). Alternatively, you can store or link a set of metadata to each file you convert to the internal format.
There may be a standard out there, but it's too hard to create 'one size fits all' solutions for these problems. There are entity relationship management tools out there (Talend, others) that make creating these mappings easier, but you will still need to spend a lot of time maintaining file format definitions and rules.
As for enforcing column width, xml might be the best solution as you can describe the formats using xml schemas (with the length restriction). For YAML or JSON, you may have to write your own logic for this, although I'm sure someone else has come up with a solution.
See XML vs comma delimited text files for further reference.
I don't know if there is any standard or open format to describe a flat file format. But one industry has done this: the banking industry. Financial institutions are indeed communicating using standardized message over a dedicated network called SWIFT. SWIFT messages were originally positional (before SWIFTML, the XMLified version). I don't know if it's a good suggestion as it's kinda obscure but maybe you could look at the SWIFT Formatting Guide, it may gives you some ideas.
Having that said, check out Flatworm, an humble flat file parser. I've used it to parse positional and/or CSV file and liked its XML descriptor format. It may be a better suggestion than SWIFT :)
CSV
CSV is a delimited data format that has fields/columns separated by the comma character and records/rows separated by newlines. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. However, if a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped by placing another double quote character next to it. The CSV file format does not require a specific character encoding, byte order, or line terminator format.
The CSV entry on wikipedia allowed me to find a comparison of data serialization formats that is pretty much what you asked for.
The only similar thing I know of is Hachoir, which can currently parse 70 file formats:
http://bitbucket.org/haypo/hachoir/wiki/Home
I'm not sure if it really counts as a declarative language, since it's plugin parser based, but it seems to work, and is extensible, which may meet your needs just fine.
As an aside, there are interesting standardised, extensible flat-file FORMATS, such as IFF (Interchange File Format).
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
EDIT: I would really like to see some general discussion about the formats, their pros and cons!
EDIT2: The 'bounty didn't really help to create the needed discussion, there are a few interesting answers but the comprehensive coverage of the topic is still missing. Six persons marked the question as favourites, which shows me that there is an interest in this discussion.
When deciding about internationalization the toughest part IMO is the choice of storage format.
For example the Zend PHP Framework offers the following adapters which cover pretty much all my options:
Array : no, hard to maintain
CSV : don't know, possible problems with encoding
Gettext : frequently used, poEdit for all platforms available BUT complicated
INI : don't know, possible problems with encoding
TBX : no clue
TMX : too much of a big thing? no editors freely available.
QT : not very widespread, no free tools
XLIFF : the comming standard? BUT no free tools available.
XMLTM : no, not what I need
basically I'm stuck with the 4 'bold' choices. I would like to use INI files but I'm reading about the encoding problems... is it really a problem, if I use strict UTF-8 (files, connections, db, etc.)?
I'm on Windows and I tried to figure out how poEdit functions but just didn't manage. No tutorials on the web either, is gettext still a choice or an endangered species anyways?
What about XLIFF, has anybody worked with it? Any tips on what tools to use?
Any ideas for Eclipse integration of any of these technologies?
POEdit isn't really hard to get a hang of. Just create a new .po file, then tell it to import strings from source files. The program scans your PHP files for any function calls matching _("Text"), gettext("Text"), etc. You can even specify your own functions to look for.
You then enter a translation in the appropriate box. When you save your .po file, a .mo file is automatically generated. That's just a binary version of the translations that gettext can easily parse.
In your PHP script make a call to bindtextdomain() telling it where your .mo file is located. Now any strings passed to gettext (or the underscore function) will be translated.
It makes it really easy to keep your translation files up to date. POEdit also has some neat features like allowing comments, showing changed and dropped strings and allowing fuzzy matches, which means you don't have to re-translate strings that have been slightly modified.
There is always Translate Toolkit which allow translating between I think all mentioned formats, and preferred gettext (po) and XLIFF.
you can use INI if you want, it's just that INI doesn't have a way to tell anyone that it is in UTF8, so if someone opens your INI with an editor, it might corrupt yout file.
So the idea is that, if you can trust the user to edit it with a UTF8 encoding.
You can add a BOM at the start of the file, some editors knows about it.
What do you want it to store ? user generated content or your application ressources ?
I worked with two of these formats on the l18n side: TMX and XLIFF. They are pretty similar. TMX is more popular nowdays, but XLIFF is gaining support quickly. There was at least one free XLIFF editor when I last looked into it: Transolution but it is not being developed now.
I do the data storage myself using a custom design - All displayed text is stored in the DB.
I have two tables.
The first table has an identity value, a 32 character varchar field (indexed on this field)
and a 200 character english description of the phrase.
My second table has the identity value from the first table, a language code (EN_UK,EN_US,etc) and an NVARCHAR column for the text.
I use an nvarchar for the text because it supports other character sets which I don't yet use.
The 32 character varchar in the first table stores something like 'pleaselogin' while the second table actually stores the full "Please enter your login and password below".
I have created a huge list of dynamic values which I replace at runtime. An example would be "You have {[dynamic:passworddaysremain]} days to change your password." - this allows me to work around the word ordering in different languages.
I have only had to deal with Arabic numerals so far but will have to work something out for the first user who requires non arabic numbers.
I actually pull this information out of the database on a 2 hourly interval and cache it to the disk in a file for each language in XML. Extensive use of the CDATA is used.
There are many options available, for performance you could use html templates for each language - My method works well but does use the XML DOM a lot at runtime to create the pages.
One rather simple approach is to just use a resource file and resource script. Programs like MSVC have no problem editing them. They're also reasonably friendly to other systems (and to text editors) as well. You can just create separate string tables (and bitmap tables) for each language, and mark each such table with what language it is in.
None of those choices looks very appetizing to me.
If you're sending files out for translation in multiple languages, then you want to be able to trust that the encodings are correct, especially if you no one in your team speaks those languages. Sometimes it's difficult to spot an encoding problem in a foreign language, and it is just too easy to inadvertantly corrupt file encodings if you let your OS 'guess'.
You really want a format that declares its encoding. Otherwise, translators or their translation tools might select something other than UTF-8. For my money, any kind of simple XML format is best, but it looks like you'd need to roll your own in Zend. XLIFF and TMX are certainly overkill.
A format like Java's XML resources would be ideal.
This might be a little different from what's been posted so far and may not be exactly what you're looking for, but I thought I would add it, if for nothing else but a different approach. I went with an object-oriented approach. What I did was create a system that encapsulates language files into a class by storing them in an array of string=>translation pairs. Access to the translation is through a method called translate with the key string as a parameter. Extending classes inherit the parent's language array and can add to it or overwrite it. Because the classes are extensible, you can change a base class and have the changes propagate through the children, making more maintainable than an array by itself. Plus, you only call the classes you need.
We just store the strings in the DB and have a translator mode built into the application to handle actually adding strings for different languages.
In the application we use various tricks to create text ids, like
£("btn_save")
£(Order.class,"amt")
The translations is loaded from the db when the system boots, or when a reload is manually triggered. The £ method takes care of looking up the translated string according the the language specified in the user session.
You can check my l10n tool called iL10Nz on http://www.myl10n.net
You can upload po/pot files, xliff, ini files , translate, download.
you can also check out this video on youtube
http://www.youtube.com/watch?v=LJLmxMFxaxA
Thanks
Olivier