You may not yet be convinced that you should abandon all legacy encodings used to encode text in all the different languages used around the world and switch to Unicode instead. If so, please first read this excellent article from Joel on Software to explain to you why. Don’t worry, I’ll be waiting here patiently for you to come back and read how to achieve the transition in your Eclipse Java projects.
Changing to Unicode? Yes, we can!
When first reading about Unicode, codepoints, character sets, character encodings and byte order marks, you might feel overwhelmed and start wondering whether this thing is even worth your while and how difficult it will be for you to convert your projects to use it. Are you really going to sell your software in Asia? Maybe you can do without it after all?
Don’t be worried. In fact, you can immediately forget about almost everything you read, especially when you work in Java, which is already using Unicode internally. It is not that difficult at all. I will go one step further and proclaim that the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things. Stick to Unicode everywhere and you won’t have any problems with character encodings ever again. Promised!
But what about all those other encodings?
They are a relic from when Unicode was still being developed and people still had to find the best way to encode text when they had to break out of their 1 byte = 1 character models. Or they are for very specific purposes that are out of scope for this article. Basically only one of those encodings matters to you today and that is UTF-8. And you won’t need to know anything more about it than how to specify that that is the encoding you will be using everywhere.
C/C++ use UCS2 internally and Java uses UTF-16, but that’s because those map easiest in memory. Unless you are planning to poke around the internals of Strings in-memory, you will never need to know anything about them. UTF-8 has emerged as the standard encoding for text in files and network traffic and even that you really don’t need any deep knowledge of. All decent software can read UTF-8 these days so there really is no reason to try to decode it yourself. The only thing you need to know is how to tell all those programs and components that UTF-8 is what you are using.
Ok, so let’s get started!
The first thing to set up is your default encoding in your editor/IDE, so any files you create will be created in Unicode/UTF-8. Since this article is about Java projects in the Eclipse IDE, that is what I’ll be describing here. Don’t worry if you are using anything else, it will support UTF-8 just as well almost for sure. If it doesn’t you should ditch it and switch.
Making sure your files are created as UTF-8
In Eclipse the encoding for a file can be specified directly on the file (which is a bit cumbersome), on it’s container (the project the file is in), or on the project’s container (the workspace).
Change your workspace settings
To change the workspace setting to UTF-8 in Eclipse so any new projects you create will be using it by default open the Workspace Preferences dialog:
<main menu> -> Window -> Preferences
In the preferences dialog, change the encoding for everything in sight to UTF-8. Easiest is to type ‘encoding’ in the filter box to get all related settings. They are at:
General -> Content Types
This has a list of file types, click through them and make sure that the encoding for them is either not set, or set to UTF-8.
General -> Workspace
Set Text file encoding to Other: UTF-8.
Unfortunately the Eclipse people insist on using Default (platform specific) here as the default setting. There have been large debates about this on their bug tracker but some of the key people seem to think that defaulting to Windows/Unix/Mac specific encodings here is wiser than just setting it to UTF-8 for the whole world. A huge mistake imho, but they are not easily convinced. Of course everyone can change this, but most people are unaware of this setting, so in practice most projects have their default encoding set to Windows-specific cp1252… (or linux/mac specific settings). Very bad imho, but there it is. When I install Eclipse or setup a new workspace, this is the first thing I change. You should try to make a habit out of that to. Best to go for Unicode/UTF-8 from the beginning so you don’t forget about it and get hurt by it later.
Web -> CSS Files, HTML Files, JSP Files and XML -> XML Files
Set encoding to ISO 10646/Unicode(UTF-8)
A bit of a confusing name, but UTF-8 is actually also an ISO standard.
Change your project settings
When this is done, check your project settings.
Select your project and open the Project Properties:
<main menu> -> Project -> Properties
If you will be sharing your project with others, you run the risk of them having forgotten to set their workspace settings correct and introducing wrongly encoded files into your project when they create new files. So to be safe, it might be smart to set your project settings to “Other: UTF-8” instead of “Inherited fromc ontainer”, because these project settings will be shared along with the project and the workspace settings are not.
Wrapping things up
If you have existing files already, it might be necessary to open them and save them again. You can check the individual encoding of files by right-clicking them and selecting Properties from the context menu. There in section Resource you can set / check their encoding. It’s ok for individual files to have it set to “Inherit from container”.
Last, but not least, check any XML, HTML and JSP files in your project.
XML files should either have no prologue, no encoding in their prologue, or this prologue:
<?xml version="1.0" encoding="UTF-8"?>
HTML files should either not define a meta tag for Content-Type, or have this one:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
JSP files should start with this:
<%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>
..to instruct the server to emit the correct Content-Type HTTP header.
Pfew! That was a lot of work. But it’s worth it!
UPDATE: I found this blog post by Timmy Jose which also discusses UTF-8 in Eclipse: Changing the encoding in Eclipse to UTF-8 – howto
(I love Unicode image stolen from Arjen Poutsma’s blog post On bytes, chars, Strings, XML and Unicode which discusses Unicode in Java in more detail)