Perforce Public Knowledge Base - Internationalization and Localization
Reset Search
 

 

Article

Internationalization and Localization

« Go Back

Information

 
Problem

This article explains how to configure a Perforce Server to run in internationalization mode and how to configure Perforce clients to work with different character sets. This articles also discusses possible problems you might encounter when handling Unicode or non-ASCII data in Perforce, as well as remedies to these problems.


Solution
In Perforce there are several ways to work with multiple character sets depending on your requirements:

  • With the addition of "utf16" as another standard Perforce filetype (see p4 help filetypes for details), it is possible to continue running your Perforce server in "non-unicode" mode AND safely manage your unicode files.

  • If your filenames or Perforce metadata contain non-ASCII characters, then your Perforce administrator might need to consider switching your Perforce Server into unicode mode as described below. When running in unicode mode, all non-file data (identifiers, descriptions, and so on), as well as the content of all files of type "unicode", are translated between the character set specified by the P4CHARSET variable on the client and UTF8 in the server.

    Before switching to unicode mode, verify that the character set you want to work with is supported. If the goal is to manage files that contain unicode characters, then you may consider standardizing on either UTF8 or UTF16 encoding. Note, starting with the 2007.2 Release, Perforce adds a new UTF16 filetype (see the Release Notes) to specifically support UTF16 files in both, non-unicode and unicode modes. To benefit from UTF16 support, all of your Perforce users need to be running 2007.2 versions of Perforce client programs.

    If you need to work on unicode files that contain characters saved in the users directory, syncing/submitting such files to/from a single client machine can become a cumbersome process, as extra steps (switching between different P4CHARSETS, installing additional Code Pages and so on) are required to complete the task.

  • If the above option is not appropriate for your situation, then the unicode files can be added as binary files. This does make diffing such files more difficult, because by default Perforce does not support diffing true binary files. However, if your binary files are true UTF8 files, then the default diff/merge tool in P4V correctly diffs them. In addition, P4Win/P4V users can also specify a third-party diff/merge tool for such files. Likewise, command line users can force the diff using the "-t" flag.

Switching the Perforce server into unicode mode

Before you use Perforce in a unicode environment, you must first instruct your Perforce Server to run in unicode mode. To set up your server to run in this mode, stop the server, and then run this command from within your Perforce server root directory:

p4d -xi

This command verifies that all existing metadata is valid UTF8 and sets a protected unicode counter, to make sure that future invocations of p4d operate in unicode mode. Once set on the server, unicode mode cannot be deactivated (that is, you cannot return to non-unicode mode). After p4d -xi switches your server into the unicode mode, you may then invoke p4d with your usual flags.

Important:

Should you try to switch the server to unicode mode with the p4d -xi command and the server responds with "invalid UTF8" messages:

Table db.user has 14 rows with invalid UTF8.

Table db.domain has 1 rows with invalid UTF8.
...

Perforce server error:
Database has 14 tables with non-UTF8 text and can't be switched to Unicode mode.

Take special note of the table names with invalid UTF8:  if one of the db.rev* or db.working tables are listed, you might have a file name with whose archive file or directory will need to be renamed.

To fix this problem, do the following:

  1. Stop the server to prevent updates during this process.
  2. Take a checkpoint
  3. Convert the checkpoint file to be UTF8 encoded.
    Summary:  use any editor or process to remove or convert non-UTF8 byte sequences to be a valid UTF8 byte sequence.
    On Unix, consider using iconv.  On Windows, a version of iconv is:  http://gnuwin32.sourceforge.net/packages/libiconv.htm
    For Windows users, you can also use a windows editor of your choice (e.g., notepad2) that can save in UTF8 encoding. The editor itself is not important, athough word processors should be avoided as they may introduce additional formatting.  Most windows editors have size limitations.  Notepad saves with a BOM which must be removed.
  4. Remove all db.* files.
  5. Restore from the UTF8 checkpoint file
  6. Verify
  7. Try p4d -xi again

To convert to legal UTF8, you can use any of the character set conversion tools that are available. The "iconv" tool/converter is a good choice and it's available for both, Unix and Windows OS's. Note, "iconv" might miss some german umlaut characters; use it diligently.

Run p4 verify immediately if you had to convert your checkpoint using  any method.  If you had db.rev* tables with invalid UTF8 then your p4 verify might show all revisions as MISSING! and the archive file or directory will need to be renamed.

User Notes

To use Perforce in an unicode environment, you must also set the P4CHARSET environment variable on your client machines. If it is not set, then users of P4V or P4SCC.DLL are asked to choose their encoding when making a first connection to a Unicode enabled server, and other users end up with a "Unicode server permits only unicode enabled clients" message.

Do not set P4CHARSET if your server is not unicode enabled.   If your command returns "Unicode clients require a unicode enabled server" then unset P4CHARSET or check that you are connecting to the expected Perforce Server.

Important:
Be aware that mixing different encodings and, consequently, P4CHARSET settings on the same computer is likely to cause file corruption and/or translation problems.

The following table lists a few of the most used (in the USA) P4CHARSET values:

Language Platform Windows
Code page
Unix
Locale
P4CHARSET
setting
English/High-ASCII Windows 1252 n/a winansi
English/High-ASCII UNIX/Linux n/a varies iso8859-1/utf8
English/High-ASCII MAC OS X n/a n/a utf8
All/untranslated All n/a n/a utf8*
All All n/a n/a utf16**

 

It's worth mentioning "none" as P4CHARSET value which a). overrides any existing P4CHARSET if used with "-C" switch and b). allows to connect to (non)/unicode enabled server. For the complete list of supported P4CHARSET values, run p4 help charset or visit: http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt

If you need a charset other than what we support, please contact Perforce Support regarding the character set encoding you would like supported.  Until we support your charset, you must work with your unicode text files in a currently supported charset.

* utf8 is untranslated, but the file content is validated.

** utf16 requires that P4COMMANDCHARSET be set to a different (non-utf16) charset
for the p4 command line client to function, for example:

p4 -C utf16 -Q utf8 sync some_files
where "-C" is a command line flag for P4CHARSET and "-Q" is for P4COMMANDCHARSET.

 

Note, that both, P4V and P4WIN have a field in the Preferences dialog to reset P4CHARSET.

Setting P4CHARSET on Windows:

  1. Log in to Windows and open an MS-DOS command prompt.
  2. Confirm that you have a True Type (TT) or Open Type font.
  3. Display your active code page on Windows machines by issuing the chcp command. Windows displays a message like the following:
    Active code page: 1252
  4. Select the character set based on the active code page as follows:

    Code page Set P4CHARSET to
    1252 winansi
    932 shiftjis


    To set P4CHARSET for all users on this workstation, you need Administrator privileges. Issue the following command:

    p4 set -s P4CHARSET=[character_set]
    

    If you do not have Administrator privileges, you can use:

    p4 set P4CHARSET=[character_set]
    

    to set P4CHARSET for the user currently logged in. Other users on the same machine have to set P4CHARSET independently.

Setting P4CHARSET on UNIX:

Set P4CHARSET to the proper value from a command shell or in a startup script such as .kshrc, .cshrc, or .profile. You can determine the proper value for P4CHARSET by examining the current setting of the LANG or LOCALE environment variable.

Sample $LANG value Set P4CHARSET to
en_US.UTF-8 utf8
ja_JP.EUC eucjp
ja_JP.PCK shiftjis

Setting P4CHARSET on MAC:

Set P4CHARSET to the proper value in either a command shell, for example:

$ export P4CHARSET=utf8

or the "environment.plist" file which resides in ~/.MacOSX directory.

Note that the first form will be valid for your running shell session only. To make the change permanent, P4CHARSET should be set in your shell's startup scripts or in the environment.plist file. The default shell is bash and the startup script is ~/.bashrc.

If P4CHARSET is not set in an environment, P4V users are prompted to select a setting from the drop down list when establishing their first connection with the Unicode enabled server.

 

Determining if the server is unicode enabled.

If you try to connect to a unicode mode enabled server to perform most commands, the server will return an error:
$ p4 counters
Unicode server permits only unicode enabled clients.
If unicode is enabled, the output of p4 counters will include a 'unicode' counter with a value of '1'.

Example:
$p4 counters
change = 1
unicode = 1
upgrade = 21
If you do not have a P4CHARSET set, or cannot run p4 counters, you can use tagged output with p4 info. The tagged info output, gernerated by p4 -Ztag info will have a field for unicode that will be set to enabled.

Example:
$ p4 -Ztag info
[...]
... clientAddress 127.0.0.1:50936
... unicode enabled
... serverAddress localhost:9988
... serverRoot introot/
... serverDate 2010/10/21 11:36:37 -0700 PDT
... serverUptime 02:46:52
... caseHandling sensitive

 

Possible problems encountered running in unicode mode

"Cannot translate" error message

This message is displayed if your client machine is configured with a character set that does not include characters being sent to it by the Perforce Server. Your client machine cannot display unmapped characters.

For example, if your client machine is configured to use the shift-JIS character set and your depot contains files named using characters from the Japanese EUC character set that do not have mappings in shift-JIS, you see the "Cannot translate..." error message when you execute a p4 files or p4 changes command that lists those files.

Length limit for Unicode Perforce identifiers

The Perforce Server has internal limits on the lengths of strings used to index job descriptions, specify filenames, control view mappings, and identify client names, label names, and other objects.

The most common limit is 1024 bytes. Because some characters in Unicode can expand to more than one byte, it is possible for certain Unicode entries to exceed Perforce internal limits.

Because no basic Unicode character expands to more than three bytes, dividing the Perforce internal limit by three ensures that no Unicode sequence exceeds the limit.

To ensure that no Unicode sequence exceeds the Perforce limit, do not create client names or view patterns that exceed 341 Unicode characters.

Under normal usage conditions, this length limit is not expected to pose a significant limitation.

Possible problems encountered using unicode filetype with a non-unicode server

With a server not running in internationalized mode, the Perforce "unicode" filetype behaves much differently.
The client and server both assume that a file is valid UTF8 and store it as such. The server does not attempt to translate or verify the content of the file in any way. It is imperative that the files be saved using an editor that can save as UTF8 prior to submitting such files to Perforce. Outside of this requirement, users can access the Perforce server normally. There is no need to set P4CHARSET on the client.

Newlines are not correctly saved

The file was checked in UTF16 instead of UTF8 by a user. Rollback to an old revision or resave the file as UTF8.

Related Links

Feedback

 

Was this article helpful?


   

Feedback

Please tell us how we can make this article more useful.

Characters Remaining: 255