In Perforce there are several ways to work with multiple character sets depending on your requirements:
- If your filenames or Perforce metadata contain non-ASCII characters, then your Perforce administrator might need to consider switching your Perforce Server into unicode mode as described below. When running in unicode mode, all non-file data (identifiers, descriptions, and so on), as well as the content of all files of type "unicode", are translated between the character set specified by the P4CHARSET variable on the client and UTF8 in the server.
Before switching to unicode mode, verify that the character set you want to work with is supported.
- If the goal is to just manage files that contain unicode characters, then you may not need p4d in unicode mode at all: "utf8" and "utf16" filetypes solve this very same problem of taking care of unicode files content.
- If you need to work on unicode files that contain characters saved in the users directory, syncing/submitting such files to/from a single client machine can become a cumbersome process, as extra steps (switching between different P4CHARSETS, installing additional Code Pages and so on) are required to complete the task.
- Note, the unicode files can always be added as binary files. This does make diffing such files more difficult, because by default Perforce does not support diffing true binary files. However, if your binary files are true UTF8 or UTF16 files, then the default diff/merge tool in P4V correctly diffs them. In addition, P4V users can also specify a third-party diff/merge tool for such files. Likewise, command line users can force the diff using the "-t" flag.
Switching the Perforce server into unicode mode
Before you use Perforce in a unicode environment, you must first instruct your Perforce Server to run in unicode mode. To set up your server to run in this mode, stop the server, and then run this command from within your Perforce server root directory:
This command verifies that all existing metadata is valid UTF8 and sets a protected unicode counter, to make sure that future invocations of p4d operate in unicode mode. Once set on the server, unicode mode cannot be deactivated (that is, you cannot return to non-unicode mode). After p4d -xi switches your server into the unicode mode, you may then invoke p4d with your usual flags.
Should you try to switch the server to unicode mode with the p4d -xi command and the server responds with "invalid UTF8" messages:
Table db.user has 14 rows with invalid UTF8.
Table db.domain has 1 rows with invalid UTF8.
Perforce server error:
Database has 14 tables with non-UTF8 text and can't be switched to Unicode mode.
Take special note of the table names with invalid UTF8: if one of the db.rev* or db.working tables are listed, you might have a file name with whose archive file or directory will need to be renamed.
To fix this problem, do the following:
- Stop the server to prevent updates during this process.
- Take a checkpoint
- Convert the checkpoint file to be UTF8 encoded.
Summary: use any editor or process to remove or convert non-UTF8 byte sequences to be a valid UTF8 byte sequence.
On Unix, consider using iconv. On Windows, a version of iconv is: http://gnuwin32.sourceforge.net/packages/libiconv.htm
For Windows users, you can also use a windows editor of your choice (e.g., notepad2) that can save in UTF8 encoding. The editor itself is not important, athough word processors should be avoided as they may introduce additional formatting. Most windows editors have size limitations. Notepad saves with a BOM which must be removed.
- Remove all db.* files.
- Restore from the UTF8 checkpoint file
- Try p4d -xi again
To convert to legal UTF8, you can use any of the character set conversion tools that are available. The "iconv" tool/converter is a good choice and it's available for both, Unix and Windows OS's. Note, "iconv" might miss some german umlaut characters; use it diligently. If identifying non-UTF8 metadata becomes a bigger issue, ask email@example.com for tool called "jnltool.pl".
Run p4 verify immediately if you had to convert your checkpoint using any method. If you had db.rev* tables with invalid UTF8 then your p4 verify might show all revisions as MISSING! and the archive file or directory will need to be renamed.
When connecting to unicode enabled Helix server, Helix clients detect and set client's Charset automatically. In very rare cases, users of P4V and other Helix client apps might be asked to choose their encoding when making a first connection to a Unicode enabled server.
Be aware that mixing different encodings and, consequently, P4CHARSET settings on the same computer is likely to cause file corruption and/or translation problems.
The following table lists a few of the most used (in the USA) P4CHARSET values:
|English/High-ASCII||MAC OS X||n/a||n/a||utf8|
It's worth mentioning "none" as P4CHARSET value which a). overrides any existing P4CHARSET if used with "-C" switch and b). allows to connect to (non)/unicode enabled server. For the complete list of supported P4CHARSET values, run p4 help charset or visit: http://www.perforce.com/perforce/doc.current/user/i18nnotes.txt
p4 -C utf16 -Q utf8 sync some_files
" is a command line flag for P4CHARSET
" is for P4COMMANDCHARSET
Note, that P4V has a field in the Preferences dialog to reset P4CHARSET.
Determining if the server is unicode enabled.
If you try to connect to a unicode mode enabled server to perform most commands, the server will return an error:
$ p4 counters
Unicode server permits only unicode enabled clients.
If unicode is enabled, the output of p4 counters
will include a 'unicode' counter with a value of '1'.
change = 1
unicode = 1
upgrade = 21
If you do not have a P4CHARSET
set, or cannot run p4 counters
, you can use tagged output with p4 info
. The tagged info output, gernerated by p4 -Ztag info
will have a field for unicode that will be set to enabled.
$ p4 -Ztag info
... clientAddress 127.0.0.1:50936
... unicode enabled
... serverAddress localhost:9988
... serverRoot introot/
... serverDate 2010/10/21 11:36:37 -0700 PDT
... serverUptime 02:46:52
... caseHandling sensitive
Possible problems encountered running in unicode mode
"Cannot translate" error message
This message is displayed if your client machine is configured with a character set that does not include characters being sent to it by the Perforce Server. Your client machine cannot display unmapped characters.
For example, if your client machine is configured to use the shift-JIS character set and your depot contains files named using characters from the Japanese EUC character set that do not have mappings in shift-JIS, you see the "Cannot translate..." error message when you execute a p4 files or p4 changes command that lists those files.
Length limit for Unicode Perforce identifiers
The Perforce Server has internal limits on the lengths of strings used to index job descriptions, specify filenames, control view mappings, and identify client names, label names, and other objects.
The most common limit is 1024 bytes. Because some characters in Unicode can expand to more than one byte, it is possible for certain Unicode entries to exceed Perforce internal limits.
Because no basic Unicode character expands to more than three bytes, dividing the Perforce internal limit by three ensures that no Unicode sequence exceeds the limit.
To ensure that no Unicode sequence exceeds the Perforce limit, do not create client names or view patterns that exceed 341 Unicode characters.
Under normal usage conditions, this length limit is not expected to pose a significant limitation.
Possible problems encountered using unicode filetype with a non-unicode server
With a server not running in internationalized mode, the Perforce "unicode" filetype behaves much differently.
The client and server both assume that a file is valid UTF8 and store it as such. The server does not attempt to translate or verify the content of the file in any way. It is imperative that the files be saved using an editor that can save as UTF8 prior to submitting such files to Perforce. Outside of this requirement, users can access the Perforce server normally. There is no need to set P4CHARSET on the client.
Newlines are not correctly saved
The file was checked in UTF16 instead of UTF8 by a user. Rollback to an old revision or resave the file as UTF8.