Unicode/UTF-8 in your Eclipse Java projects

I love Unicode, with the heart symbol ironically replaced with the broken character (question mark) symbolASCII is dead! Long live Unicode!

You may not yet be convinced that you should abandon all legacy encodings used to encode text in all the different languages used around the world and switch to Unicode instead. If so, please first read this excellent article from Joel on Software to explain to you why. Don’t worry, I’ll be waiting here patiently for you to come back and read how to achieve the transition in your Eclipse Java projects.

Joel on software: Unicode and Character sets

Changing to Unicode? Yes, we can!

When first reading about Unicode, codepoints, character sets, character encodings and byte order marks, you might feel overwhelmed and start wondering whether this thing is even worth your while and how difficult it will be for you to convert your projects to use it. Are you really going to sell your software in Asia? Maybe you can do without it after all?

Don’t be worried. In fact, you can immediately forget about almost everything you read, especially when you work in Java, which is already using Unicode internally. It is not that difficult at all. I will go one step further and proclaim that the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things. Stick to Unicode everywhere and you won’t have any problems with character encodings ever again. Promised!

But what about all those other encodings?

Forget them.

They are a relic from when Unicode was still being developed and people still had to find the best way to encode text when they had to break out of their 1 byte = 1 character models. Or they are for very specific purposes that are out of scope for this article. Basically only one of those encodings matters to you today and that is UTF-8. And you won’t need to know anything more about it than how to specify that that is the encoding you will be using everywhere.

C/C++ use UCS2 internally and Java uses UTF-16, but that’s because those map easiest in memory. Unless you are planning to poke around the internals of Strings in-memory, you will never need to know anything about them. UTF-8 has emerged as the standard encoding for text in files and network traffic and even that you really don’t need any deep knowledge of. All decent software can read UTF-8 these days so there really is no reason to try to decode it yourself. The only thing you need to know is how to tell all those programs and components that UTF-8 is what you are using.

Ok, so let’s get started!

The first thing to set up is your default encoding in your editor/IDE, so any files you create will be created in Unicode/UTF-8. Since this article is about Java projects in the Eclipse IDE, that is what I’ll be describing here. Don’t worry if you are using anything else, it will support UTF-8 just as well almost for sure. If it doesn’t you should ditch it and switch.

Making sure your files are created as UTF-8

In Eclipse the encoding for a file can be specified directly on the file (which is a bit cumbersome), on it’s container (the project the file is in), or on the project’s container (the workspace).

Change your workspace settings

To change the workspace setting to UTF-8 in Eclipse so any new projects you create will be using it by default open the Workspace Preferences dialog:

<main menu> -> Window -> Preferences

In the preferences dialog, change the encoding for everything in sight to UTF-8. Easiest is to type ‘encoding’ in the filter box to get all related settings. They are at:

General -> Content Types

This has a list of file types, click through them and make sure that the encoding for them is either not set, or set to UTF-8.

Screenshot of the Encoding -> Content Types section of the Workspace Preferences dialog in Eclipse IDE

Workspace Preferences -> Encoding -> Content Types

General -> Workspace

Set Text file encoding to Other: UTF-8.

Screenshot of the Encoding -> Workspace section of the Workspace Preferences dialog in Eclipse IDE

Workspace Preferences -> Encoding -> Workspace

Unfortunately the Eclipse people insist on using Default (platform specific) here as the default setting. There have been large debates about this on their bug tracker but some of the key people seem to think that defaulting to Windows/Unix/Mac specific encodings here is wiser than just setting it to UTF-8 for the whole world. A huge mistake imho, but they are not easily convinced. Of course everyone can change this, but most people are unaware of this setting, so in practice most projects have their default encoding set to Windows-specific cp1252… (or linux/mac specific settings). Very bad imho, but there it is. When I install Eclipse or setup a new workspace, this is the first thing I change. You should try to make a habit out of that to. Best to go for Unicode/UTF-8 from the beginning so you don’t forget about it and get hurt by it later.

Web -> CSS Files, HTML Files, JSP Files and XML -> XML Files

Set encoding to ISO 10646/Unicode(UTF-8)

A bit of a confusing name, but UTF-8 is actually also an ISO standard.

Screenshot of the Web/XML -> Files section of the Workspace Preferences dialog in Eclipse IDE

Preferences -> Web/Xml -> Files

Change your project settings

When this is done, check your project settings.

Select your project and open the Project Properties:

<main menu> -> Project -> Properties

If you will be sharing your project with others, you run the risk of them having forgotten to set their workspace settings correct and introducing wrongly encoded files into your project when they create new files. So to be safe, it might be smart to set your project settings to “Other: UTF-8” instead of “Inherited fromc ontainer”, because these project settings will be shared along with the project and the workspace settings are not.

Screenshot of the Resource section of the Project Properties dialog in Eclipse IDE

Project Properties -> Resource

Wrapping things up

If you have existing files already, it might be necessary to open them and save them again. You can check the individual encoding of files by right-clicking them and selecting Properties from the context menu. There in section Resource you can set / check their encoding. It’s ok for individual files to have it set to “Inherit from container”.

Last, but not least, check any XML, HTML and JSP files in your project.

XML files should either have no prologue, no encoding in their prologue, or this prologue:

<?xml version="1.0" encoding="UTF-8"?>

HTML files should either not define a meta tag for Content-Type, or have this one:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

JSP files should start with this:

<%@ page language="java" contentType="text/html; charset=UTF-8" pageEncoding="UTF-8"%>

..to instruct the server to emit the correct Content-Type HTTP header.

Pfew! That was a lot of work. But it’s worth it!

UPDATE: I found this blog post by Timmy Jose which also discusses UTF-8 in Eclipse: Changing the encoding in Eclipse to UTF-8 – howto

(I love Unicode image stolen from Arjen Poutsma’s blog post On bytes, chars, Strings, XML and Unicode which discusses Unicode in Java in more detail)

28 responses to “Unicode/UTF-8 in your Eclipse Java projects

  1. Hi,
    This did not worked in my case. I actually want to set the unicode value in Text box at run time.
    I don’t know where I was wrong. I am using Twist which uses editor similar to eclipse.
    I did above settings but now the editor not even showing me the unicode characters.

    Thank you,

  2. Hi,
    The issue got solved after reopening the test.
    sorry as I missed that part in your post.
    Thanks

  3. I am using Eclipse Juno so some of the screens are a little different but the solution solved the issue I was having with ? being displayed when displaying unicode characters. Thanks for that

  4. It is not working in my eclipse. I have to compile java file in UTF-8. Current encoding is ANSI.Can you please help me.

  5. I have tried all the process. Nothing is working. .class all the time coming in ANSI encoding.Can you please solve the issue.

  6. it did not work for me, i did all the steps and when i write the word ‘Verificación’ in the eclipse editor, the editor put this :Verificaci�n .
    what could it be?

  7. The settings described in this post mostly apply to new files and new projects. You will have to set the encoding of existing stuff manually.

    Check the properties of the file that has the wrong text in it. Make sure encoding is set to UTF-8. Open the file in an external editor to verify (I recommend Notepad++).

  8. thanks! , i changed to UTF-8 in the properties , then i opened in an external editor and copied back to eclipse , and it work.

  9. HI, checked out a project , but it has thousands of this simbol � that are accents like ó for example , is there a fast way to convert the whole project ?, i dont want to go file by file , it a lot of files

  10. Why don’t you write a small Java program to do it? Figure out the current encoding of the files and pass that encoding to an input stream reader. Read in the file in it’s current encoding, then save it with “UTF-8” encoding.

    But I’d do a google search first, because it seems a common problem, so maybe someone already wrote such a program. 😉

  11. I followed your steps (using Eclipse Java EE IDE for Web Developers.Version: Luna Release (4.4.0) Build id: 20140612-0600) and it didn’t work however, the project jsp’s work right out of the box using: Eclipse IDE for Java Developers Version: Luna Service Release 2 (4.4.2) Build id: 20150219-0600. Please help. What is it about the EE edition and not the regular edition.
    Thanks
    Patrick

  12. Excellent post, works for me, and I agree totally that there is no excuse using anything other than UTF-8. However, your statement about all software supporting it is not true. Microsoft Excel (at least Mac 2011) does not support UTF-8 — a right pain.
    David

  13. Hi David, I never used Excel on the Mac, but in the Windows version it is supported AFAIK. Anyways sometimes you have to overstate things a little to make a point 😉

  14. Pingback: First blog post | inforbloger

  15. Pingback: Eclipse Project and UTF-8 issue – Sayed's Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s