Mozilla have a pretty nice universal character set detector built into their products. It’s modular, it’s quick, and it has a great deal of real-world research and testing behind it. I wanted to be able to use it as part of a project I am working on, but couldn’t find a nice standalone command-line version. There is a Java port, but the overhead of loading up a JVM just to detect the character set of a document was unappealing, and porting the entire codebase to another language would take too long (plus it would run a lot slower). So, I spent an evening learning some C/C++ and came up with just what I needed. I thought it might be useful to someone else, too, so I am releasing it here.
The README.txt contains compilation and usage instructions. I have no more words now. Get it below!
Comment #29
Colin,
I've used some of your code as the basis for a library that exports a C interface to the UCSD. I've also managed to remove the dependency on NSPR, as it's only used in trivial ways by the Detector. (This is important for me as I don't feel up to cross-compiling all of NSPR to the iPhone).
You can find the library at http://github.com/batterseapower/libcharsetdetect. In the README I've shown how you can build your universalchardet executable as a client of the library, to help folks get started with the API.
Thanks for publishing your code!
Max