[ Index ] |
PHP Cross Reference of Unnamed Project |
[Summary view] [Print] [Text view]
1 =head1 NAME 2 3 perlunicode - Unicode support in Perl 4 5 =head1 DESCRIPTION 6 7 =head2 Important Caveats 8 9 Unicode support is an extensive requirement. While Perl does not 10 implement the Unicode standard or the accompanying technical reports 11 from cover to cover, Perl does support many Unicode features. 12 13 People who want to learn to use Unicode in Perl, should probably read 14 L<the Perl Unicode tutorial|perlunitut> before reading this reference 15 document. 16 17 =over 4 18 19 =item Input and Output Layers 20 21 Perl knows when a filehandle uses Perl's internal Unicode encodings 22 (UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with 23 the ":utf8" layer. Other encodings can be converted to Perl's 24 encoding on input or from Perl's encoding on output by use of the 25 ":encoding(...)" layer. See L<open>. 26 27 To indicate that Perl source itself is in UTF-8, use C<use utf8;>. 28 29 =item Regular Expressions 30 31 The regular expression compiler produces polymorphic opcodes. That is, 32 the pattern adapts to the data and automatically switches to the Unicode 33 character scheme when presented with data that is internally encoded in 34 UTF-8 -- or instead uses a traditional byte scheme when presented with 35 byte data. 36 37 =item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts 38 39 As a compatibility measure, the C<use utf8> pragma must be explicitly 40 included to enable recognition of UTF-8 in the Perl scripts themselves 41 (in string or regular expression literals, or in identifier names) on 42 ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based 43 machines. B<These are the only times when an explicit C<use utf8> 44 is needed.> See L<utf8>. 45 46 =item BOM-marked scripts and UTF-16 scripts autodetected 47 48 If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE, 49 or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either 50 endianness, Perl will correctly read in the script as Unicode. 51 (BOMless UTF-8 cannot be effectively recognized or differentiated from 52 ISO 8859-1 or other eight-bit encodings.) 53 54 =item C<use encoding> needed to upgrade non-Latin-1 byte strings 55 56 By default, there is a fundamental asymmetry in Perl's Unicode model: 57 implicit upgrading from byte strings to Unicode strings assumes that 58 they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are 59 downgraded with UTF-8 encoding. This happens because the first 256 60 codepoints in Unicode happens to agree with Latin-1. 61 62 See L</"Byte and Character Semantics"> for more details. 63 64 =back 65 66 =head2 Byte and Character Semantics 67 68 Beginning with version 5.6, Perl uses logically-wide characters to 69 represent strings internally. 70 71 In future, Perl-level operations will be expected to work with 72 characters rather than bytes. 73 74 However, as an interim compatibility measure, Perl aims to 75 provide a safe migration path from byte semantics to character 76 semantics for programs. For operations where Perl can unambiguously 77 decide that the input data are characters, Perl switches to 78 character semantics. For operations where this determination cannot 79 be made without additional information from the user, Perl decides in 80 favor of compatibility and chooses to use byte semantics. 81 82 This behavior preserves compatibility with earlier versions of Perl, 83 which allowed byte semantics in Perl operations only if 84 none of the program's inputs were marked as being as source of Unicode 85 character data. Such data may come from filehandles, from calls to 86 external programs, from information provided by the system (such as %ENV), 87 or from literals and constants in the source text. 88 89 The C<bytes> pragma will always, regardless of platform, force byte 90 semantics in a particular lexical scope. See L<bytes>. 91 92 The C<utf8> pragma is primarily a compatibility device that enables 93 recognition of UTF-(8|EBCDIC) in literals encountered by the parser. 94 Note that this pragma is only required while Perl defaults to byte 95 semantics; when character semantics become the default, this pragma 96 may become a no-op. See L<utf8>. 97 98 Unless explicitly stated, Perl operators use character semantics 99 for Unicode data and byte semantics for non-Unicode data. 100 The decision to use character semantics is made transparently. If 101 input data comes from a Unicode source--for example, if a character 102 encoding layer is added to a filehandle or a literal Unicode 103 string constant appears in a program--character semantics apply. 104 Otherwise, byte semantics are in effect. The C<bytes> pragma should 105 be used to force byte semantics on Unicode data. 106 107 If strings operating under byte semantics and strings with Unicode 108 character data are concatenated, the new string will be created by 109 decoding the byte strings as I<ISO 8859-1 (Latin-1)>, even if the 110 old Unicode string used EBCDIC. This translation is done without 111 regard to the system's native 8-bit encoding. 112 113 Under character semantics, many operations that formerly operated on 114 bytes now operate on characters. A character in Perl is 115 logically just a number ranging from 0 to 2**31 or so. Larger 116 characters may encode into longer sequences of bytes internally, but 117 this internal detail is mostly hidden for Perl code. 118 See L<perluniintro> for more. 119 120 =head2 Effects of Character Semantics 121 122 Character semantics have the following effects: 123 124 =over 4 125 126 =item * 127 128 Strings--including hash keys--and regular expression patterns may 129 contain characters that have an ordinal value larger than 255. 130 131 If you use a Unicode editor to edit your program, Unicode characters may 132 occur directly within the literal strings in UTF-8 encoding, or UTF-16. 133 (The former requires a BOM or C<use utf8>, the latter requires a BOM.) 134 135 Unicode characters can also be added to a string by using the C<\x{...}> 136 notation. The Unicode code for the desired character, in hexadecimal, 137 should be placed in the braces. For instance, a smiley face is 138 C<\x{263A}>. This encoding scheme only works for all characters, but 139 for characters under 0x100, note that Perl may use an 8 bit encoding 140 internally, for optimization and/or backward compatibility. 141 142 Additionally, if you 143 144 use charnames ':full'; 145 146 you can use the C<\N{...}> notation and put the official Unicode 147 character name within the braces, such as C<\N{WHITE SMILING FACE}>. 148 149 =item * 150 151 If an appropriate L<encoding> is specified, identifiers within the 152 Perl script may contain Unicode alphanumeric characters, including 153 ideographs. Perl does not currently attempt to canonicalize variable 154 names. 155 156 =item * 157 158 Regular expressions match characters instead of bytes. "." matches 159 a character instead of a byte. 160 161 =item * 162 163 Character classes in regular expressions match characters instead of 164 bytes and match against the character properties specified in the 165 Unicode properties database. C<\w> can be used to match a Japanese 166 ideograph, for instance. 167 168 =item * 169 170 Named Unicode properties, scripts, and block ranges may be used like 171 character classes via the C<\p{}> "matches property" construct and 172 the C<\P{}> negation, "doesn't match property". 173 174 See L</"Unicode Character Properties"> for more details. 175 176 You can define your own character properties and use them 177 in the regular expression with the C<\p{}> or C<\P{}> construct. 178 179 See L</"User-Defined Character Properties"> for more details. 180 181 =item * 182 183 The special pattern C<\X> matches any extended Unicode 184 sequence--"a combining character sequence" in Standardese--where the 185 first character is a base character and subsequent characters are mark 186 characters that apply to the base character. C<\X> is equivalent to 187 C<(?:\PM\pM*)>. 188 189 =item * 190 191 The C<tr///> operator translates characters instead of bytes. Note 192 that the C<tr///CU> functionality has been removed. For similar 193 functionality see pack('U0', ...) and pack('C0', ...). 194 195 =item * 196 197 Case translation operators use the Unicode case translation tables 198 when character input is provided. Note that C<uc()>, or C<\U> in 199 interpolated strings, translates to uppercase, while C<ucfirst>, 200 or C<\u> in interpolated strings, translates to titlecase in languages 201 that make the distinction. 202 203 =item * 204 205 Most operators that deal with positions or lengths in a string will 206 automatically switch to using character positions, including 207 C<chop()>, C<chomp()>, C<substr()>, C<pos()>, C<index()>, C<rindex()>, 208 C<sprintf()>, C<write()>, and C<length()>. An operator that 209 specifically does not switch is C<vec()>. Operators that really don't 210 care include operators that treat strings as a bucket of bits such as 211 C<sort()>, and operators dealing with filenames. 212 213 =item * 214 215 The C<pack()>/C<unpack()> letter C<C> does I<not> change, since it is often 216 used for byte-oriented formats. Again, think C<char> in the C language. 217 218 There is a new C<U> specifier that converts between Unicode characters 219 and code points. There is also a C<W> specifier that is the equivalent of 220 C<chr>/C<ord> and properly handles character values even if they are above 255. 221 222 =item * 223 224 The C<chr()> and C<ord()> functions work on characters, similar to 225 C<pack("W")> and C<unpack("W")>, I<not> C<pack("C")> and 226 C<unpack("C")>. C<pack("C")> and C<unpack("C")> are methods for 227 emulating byte-oriented C<chr()> and C<ord()> on Unicode strings. 228 While these methods reveal the internal encoding of Unicode strings, 229 that is not something one normally needs to care about at all. 230 231 =item * 232 233 The bit string operators, C<& | ^ ~>, can operate on character data. 234 However, for backward compatibility, such as when using bit string 235 operations when characters are all less than 256 in ordinal value, one 236 should not use C<~> (the bit complement) with characters of both 237 values less than 256 and values greater than 256. Most importantly, 238 DeMorgan's laws (C<~($x|$y) eq ~$x&~$y> and C<~($x&$y) eq ~$x|~$y>) 239 will not hold. The reason for this mathematical I<faux pas> is that 240 the complement cannot return B<both> the 8-bit (byte-wide) bit 241 complement B<and> the full character-wide bit complement. 242 243 =item * 244 245 lc(), uc(), lcfirst(), and ucfirst() work for the following cases: 246 247 =over 8 248 249 =item * 250 251 the case mapping is from a single Unicode character to another 252 single Unicode character, or 253 254 =item * 255 256 the case mapping is from a single Unicode character to more 257 than one Unicode character. 258 259 =back 260 261 Things to do with locales (Lithuanian, Turkish, Azeri) do B<not> work 262 since Perl does not understand the concept of Unicode locales. 263 264 See the Unicode Technical Report #21, Case Mappings, for more details. 265 266 But you can also define your own mappings to be used in the lc(), 267 lcfirst(), uc(), and ucfirst() (or their string-inlined versions). 268 269 See L</"User-Defined Case Mappings"> for more details. 270 271 =back 272 273 =over 4 274 275 =item * 276 277 And finally, C<scalar reverse()> reverses by character rather than by byte. 278 279 =back 280 281 =head2 Unicode Character Properties 282 283 Named Unicode properties, scripts, and block ranges may be used like 284 character classes via the C<\p{}> "matches property" construct and 285 the C<\P{}> negation, "doesn't match property". 286 287 For instance, C<\p{Lu}> matches any character with the Unicode "Lu" 288 (Letter, uppercase) property, while C<\p{M}> matches any character 289 with an "M" (mark--accents and such) property. Brackets are not 290 required for single letter properties, so C<\p{M}> is equivalent to 291 C<\pM>. Many predefined properties are available, such as 292 C<\p{Mirrored}> and C<\p{Tibetan}>. 293 294 The official Unicode script and block names have spaces and dashes as 295 separators, but for convenience you can use dashes, spaces, or 296 underbars, and case is unimportant. It is recommended, however, that 297 for consistency you use the following naming: the official Unicode 298 script, property, or block name (see below for the additional rules 299 that apply to block names) with whitespace and dashes removed, and the 300 words "uppercase-first-lowercase-rest". C<Latin-1 Supplement> thus 301 becomes C<Latin1Supplement>. 302 303 You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret 304 (^) between the first brace and the property name: C<\p{^Tamil}> is 305 equal to C<\P{Tamil}>. 306 307 B<NOTE: the properties, scripts, and blocks listed here are as of 308 Unicode 5.0.0 in July 2006.> 309 310 =over 4 311 312 =item General Category 313 314 Here are the basic Unicode General Category properties, followed by their 315 long form. You can use either; C<\p{Lu}> and C<\p{UppercaseLetter}>, 316 for instance, are identical. 317 318 Short Long 319 320 L Letter 321 LC CasedLetter 322 Lu UppercaseLetter 323 Ll LowercaseLetter 324 Lt TitlecaseLetter 325 Lm ModifierLetter 326 Lo OtherLetter 327 328 M Mark 329 Mn NonspacingMark 330 Mc SpacingMark 331 Me EnclosingMark 332 333 N Number 334 Nd DecimalNumber 335 Nl LetterNumber 336 No OtherNumber 337 338 P Punctuation 339 Pc ConnectorPunctuation 340 Pd DashPunctuation 341 Ps OpenPunctuation 342 Pe ClosePunctuation 343 Pi InitialPunctuation 344 (may behave like Ps or Pe depending on usage) 345 Pf FinalPunctuation 346 (may behave like Ps or Pe depending on usage) 347 Po OtherPunctuation 348 349 S Symbol 350 Sm MathSymbol 351 Sc CurrencySymbol 352 Sk ModifierSymbol 353 So OtherSymbol 354 355 Z Separator 356 Zs SpaceSeparator 357 Zl LineSeparator 358 Zp ParagraphSeparator 359 360 C Other 361 Cc Control 362 Cf Format 363 Cs Surrogate (not usable) 364 Co PrivateUse 365 Cn Unassigned 366 367 Single-letter properties match all characters in any of the 368 two-letter sub-properties starting with the same letter. 369 C<LC> and C<L&> are special cases, which are aliases for the set of 370 C<Ll>, C<Lu>, and C<Lt>. 371 372 Because Perl hides the need for the user to understand the internal 373 representation of Unicode characters, there is no need to implement 374 the somewhat messy concept of surrogates. C<Cs> is therefore not 375 supported. 376 377 =item Bidirectional Character Types 378 379 Because scripts differ in their directionality--Hebrew is 380 written right to left, for example--Unicode supplies these properties in 381 the BidiClass class: 382 383 Property Meaning 384 385 L Left-to-Right 386 LRE Left-to-Right Embedding 387 LRO Left-to-Right Override 388 R Right-to-Left 389 AL Right-to-Left Arabic 390 RLE Right-to-Left Embedding 391 RLO Right-to-Left Override 392 PDF Pop Directional Format 393 EN European Number 394 ES European Number Separator 395 ET European Number Terminator 396 AN Arabic Number 397 CS Common Number Separator 398 NSM Non-Spacing Mark 399 BN Boundary Neutral 400 B Paragraph Separator 401 S Segment Separator 402 WS Whitespace 403 ON Other Neutrals 404 405 For example, C<\p{BidiClass:R}> matches characters that are normally 406 written right to left. 407 408 =item Scripts 409 410 The script names which can be used by C<\p{...}> and C<\P{...}>, 411 such as in C<\p{Latin}> or C<\p{Cyrillic}>, are as follows: 412 413 Arabic 414 Armenian 415 Balinese 416 Bengali 417 Bopomofo 418 Braille 419 Buginese 420 Buhid 421 CanadianAboriginal 422 Cherokee 423 Coptic 424 Cuneiform 425 Cypriot 426 Cyrillic 427 Deseret 428 Devanagari 429 Ethiopic 430 Georgian 431 Glagolitic 432 Gothic 433 Greek 434 Gujarati 435 Gurmukhi 436 Han 437 Hangul 438 Hanunoo 439 Hebrew 440 Hiragana 441 Inherited 442 Kannada 443 Katakana 444 Kharoshthi 445 Khmer 446 Lao 447 Latin 448 Limbu 449 LinearB 450 Malayalam 451 Mongolian 452 Myanmar 453 NewTaiLue 454 Nko 455 Ogham 456 OldItalic 457 OldPersian 458 Oriya 459 Osmanya 460 PhagsPa 461 Phoenician 462 Runic 463 Shavian 464 Sinhala 465 SylotiNagri 466 Syriac 467 Tagalog 468 Tagbanwa 469 TaiLe 470 Tamil 471 Telugu 472 Thaana 473 Thai 474 Tibetan 475 Tifinagh 476 Ugaritic 477 Yi 478 479 =item Extended property classes 480 481 Extended property classes can supplement the basic 482 properties, defined by the F<PropList> Unicode database: 483 484 ASCIIHexDigit 485 BidiControl 486 Dash 487 Deprecated 488 Diacritic 489 Extender 490 HexDigit 491 Hyphen 492 Ideographic 493 IDSBinaryOperator 494 IDSTrinaryOperator 495 JoinControl 496 LogicalOrderException 497 NoncharacterCodePoint 498 OtherAlphabetic 499 OtherDefaultIgnorableCodePoint 500 OtherGraphemeExtend 501 OtherIDStart 502 OtherIDContinue 503 OtherLowercase 504 OtherMath 505 OtherUppercase 506 PatternSyntax 507 PatternWhiteSpace 508 QuotationMark 509 Radical 510 SoftDotted 511 STerm 512 TerminalPunctuation 513 UnifiedIdeograph 514 VariationSelector 515 WhiteSpace 516 517 and there are further derived properties: 518 519 Alphabetic = Lu + Ll + Lt + Lm + Lo + Nl + OtherAlphabetic 520 Lowercase = Ll + OtherLowercase 521 Uppercase = Lu + OtherUppercase 522 Math = Sm + OtherMath 523 524 IDStart = Lu + Ll + Lt + Lm + Lo + Nl + OtherIDStart 525 IDContinue = IDStart + Mn + Mc + Nd + Pc + OtherIDContinue 526 527 DefaultIgnorableCodePoint 528 = OtherDefaultIgnorableCodePoint 529 + Cf + Cc + Cs + Noncharacters + VariationSelector 530 - WhiteSpace - FFF9..FFFB (Annotation Characters) 531 532 Any = Any code points (i.e. U+0000 to U+10FFFF) 533 Assigned = Any non-Cn code points (i.e. synonym for \P{Cn}) 534 Unassigned = Synonym for \p{Cn} 535 ASCII = ASCII (i.e. U+0000 to U+007F) 536 537 Common = Any character (or unassigned code point) 538 not explicitly assigned to a script 539 540 =item Use of "Is" Prefix 541 542 For backward compatibility (with Perl 5.6), all properties mentioned 543 so far may have C<Is> prepended to their name, so C<\P{IsLu}>, for 544 example, is equal to C<\P{Lu}>. 545 546 =item Blocks 547 548 In addition to B<scripts>, Unicode also defines B<blocks> of 549 characters. The difference between scripts and blocks is that the 550 concept of scripts is closer to natural languages, while the concept 551 of blocks is more of an artificial grouping based on groups of 256 552 Unicode characters. For example, the C<Latin> script contains letters 553 from many blocks but does not contain all the characters from those 554 blocks. It does not, for example, contain digits, because digits are 555 shared across many scripts. Digits and similar groups, like 556 punctuation, are in a category called C<Common>. 557 558 For more about scripts, see the UAX#24 "Script Names": 559 560 http://www.unicode.org/reports/tr24/ 561 562 For more about blocks, see: 563 564 http://www.unicode.org/Public/UNIDATA/Blocks.txt 565 566 Block names are given with the C<In> prefix. For example, the 567 Katakana block is referenced via C<\p{InKatakana}>. The C<In> 568 prefix may be omitted if there is no naming conflict with a script 569 or any other property, but it is recommended that C<In> always be used 570 for block tests to avoid confusion. 571 572 These block names are supported: 573 574 InAegeanNumbers 575 InAlphabeticPresentationForms 576 InAncientGreekMusicalNotation 577 InAncientGreekNumbers 578 InArabic 579 InArabicPresentationFormsA 580 InArabicPresentationFormsB 581 InArabicSupplement 582 InArmenian 583 InArrows 584 InBalinese 585 InBasicLatin 586 InBengali 587 InBlockElements 588 InBopomofo 589 InBopomofoExtended 590 InBoxDrawing 591 InBraillePatterns 592 InBuginese 593 InBuhid 594 InByzantineMusicalSymbols 595 InCJKCompatibility 596 InCJKCompatibilityForms 597 InCJKCompatibilityIdeographs 598 InCJKCompatibilityIdeographsSupplement 599 InCJKRadicalsSupplement 600 InCJKStrokes 601 InCJKSymbolsAndPunctuation 602 InCJKUnifiedIdeographs 603 InCJKUnifiedIdeographsExtensionA 604 InCJKUnifiedIdeographsExtensionB 605 InCherokee 606 InCombiningDiacriticalMarks 607 InCombiningDiacriticalMarksSupplement 608 InCombiningDiacriticalMarksforSymbols 609 InCombiningHalfMarks 610 InControlPictures 611 InCoptic 612 InCountingRodNumerals 613 InCuneiform 614 InCuneiformNumbersAndPunctuation 615 InCurrencySymbols 616 InCypriotSyllabary 617 InCyrillic 618 InCyrillicSupplement 619 InDeseret 620 InDevanagari 621 InDingbats 622 InEnclosedAlphanumerics 623 InEnclosedCJKLettersAndMonths 624 InEthiopic 625 InEthiopicExtended 626 InEthiopicSupplement 627 InGeneralPunctuation 628 InGeometricShapes 629 InGeorgian 630 InGeorgianSupplement 631 InGlagolitic 632 InGothic 633 InGreekExtended 634 InGreekAndCoptic 635 InGujarati 636 InGurmukhi 637 InHalfwidthAndFullwidthForms 638 InHangulCompatibilityJamo 639 InHangulJamo 640 InHangulSyllables 641 InHanunoo 642 InHebrew 643 InHighPrivateUseSurrogates 644 InHighSurrogates 645 InHiragana 646 InIPAExtensions 647 InIdeographicDescriptionCharacters 648 InKanbun 649 InKangxiRadicals 650 InKannada 651 InKatakana 652 InKatakanaPhoneticExtensions 653 InKharoshthi 654 InKhmer 655 InKhmerSymbols 656 InLao 657 InLatin1Supplement 658 InLatinExtendedA 659 InLatinExtendedAdditional 660 InLatinExtendedB 661 InLatinExtendedC 662 InLatinExtendedD 663 InLetterlikeSymbols 664 InLimbu 665 InLinearBIdeograms 666 InLinearBSyllabary 667 InLowSurrogates 668 InMalayalam 669 InMathematicalAlphanumericSymbols 670 InMathematicalOperators 671 InMiscellaneousMathematicalSymbolsA 672 InMiscellaneousMathematicalSymbolsB 673 InMiscellaneousSymbols 674 InMiscellaneousSymbolsAndArrows 675 InMiscellaneousTechnical 676 InModifierToneLetters 677 InMongolian 678 InMusicalSymbols 679 InMyanmar 680 InNKo 681 InNewTaiLue 682 InNumberForms 683 InOgham 684 InOldItalic 685 InOldPersian 686 InOpticalCharacterRecognition 687 InOriya 688 InOsmanya 689 InPhagspa 690 InPhoenician 691 InPhoneticExtensions 692 InPhoneticExtensionsSupplement 693 InPrivateUseArea 694 InRunic 695 InShavian 696 InSinhala 697 InSmallFormVariants 698 InSpacingModifierLetters 699 InSpecials 700 InSuperscriptsAndSubscripts 701 InSupplementalArrowsA 702 InSupplementalArrowsB 703 InSupplementalMathematicalOperators 704 InSupplementalPunctuation 705 InSupplementaryPrivateUseAreaA 706 InSupplementaryPrivateUseAreaB 707 InSylotiNagri 708 InSyriac 709 InTagalog 710 InTagbanwa 711 InTags 712 InTaiLe 713 InTaiXuanJingSymbols 714 InTamil 715 InTelugu 716 InThaana 717 InThai 718 InTibetan 719 InTifinagh 720 InUgaritic 721 InUnifiedCanadianAboriginalSyllabics 722 InVariationSelectors 723 InVariationSelectorsSupplement 724 InVerticalForms 725 InYiRadicals 726 InYiSyllables 727 InYijingHexagramSymbols 728 729 =back 730 731 =head2 User-Defined Character Properties 732 733 You can define your own character properties by defining subroutines 734 whose names begin with "In" or "Is". The subroutines can be defined in 735 any package. The user-defined properties can be used in the regular 736 expression C<\p> and C<\P> constructs; if you are using a user-defined 737 property from a package other than the one you are in, you must specify 738 its package in the C<\p> or C<\P> construct. 739 740 # assuming property IsForeign defined in Lang:: 741 package main; # property package name required 742 if ($txt =~ /\p{Lang::IsForeign}+/) { ... } 743 744 package Lang; # property package name not required 745 if ($txt =~ /\p{IsForeign}+/) { ... } 746 747 748 Note that the effect is compile-time and immutable once defined. 749 750 The subroutines must return a specially-formatted string, with one 751 or more newline-separated lines. Each line must be one of the following: 752 753 =over 4 754 755 =item * 756 757 A single hexadecimal number denoting a Unicode code point to include. 758 759 =item * 760 761 Two hexadecimal numbers separated by horizontal whitespace (space or 762 tabular characters) denoting a range of Unicode code points to include. 763 764 =item * 765 766 Something to include, prefixed by "+": a built-in character 767 property (prefixed by "utf8::") or a user-defined character property, 768 to represent all the characters in that property; two hexadecimal code 769 points for a range; or a single hexadecimal code point. 770 771 =item * 772 773 Something to exclude, prefixed by "-": an existing character 774 property (prefixed by "utf8::") or a user-defined character property, 775 to represent all the characters in that property; two hexadecimal code 776 points for a range; or a single hexadecimal code point. 777 778 =item * 779 780 Something to negate, prefixed "!": an existing character 781 property (prefixed by "utf8::") or a user-defined character property, 782 to represent all the characters in that property; two hexadecimal code 783 points for a range; or a single hexadecimal code point. 784 785 =item * 786 787 Something to intersect with, prefixed by "&": an existing character 788 property (prefixed by "utf8::") or a user-defined character property, 789 for all the characters except the characters in the property; two 790 hexadecimal code points for a range; or a single hexadecimal code point. 791 792 =back 793 794 For example, to define a property that covers both the Japanese 795 syllabaries (hiragana and katakana), you can define 796 797 sub InKana { 798 return <<END; 799 3040\t309F 800 30A0\t30FF 801 END 802 } 803 804 Imagine that the here-doc end marker is at the beginning of the line. 805 Now you can use C<\p{InKana}> and C<\P{InKana}>. 806 807 You could also have used the existing block property names: 808 809 sub InKana { 810 return <<'END'; 811 +utf8::InHiragana 812 +utf8::InKatakana 813 END 814 } 815 816 Suppose you wanted to match only the allocated characters, 817 not the raw block ranges: in other words, you want to remove 818 the non-characters: 819 820 sub InKana { 821 return <<'END'; 822 +utf8::InHiragana 823 +utf8::InKatakana 824 -utf8::IsCn 825 END 826 } 827 828 The negation is useful for defining (surprise!) negated classes. 829 830 sub InNotKana { 831 return <<'END'; 832 !utf8::InHiragana 833 -utf8::InKatakana 834 +utf8::IsCn 835 END 836 } 837 838 Intersection is useful for getting the common characters matched by 839 two (or more) classes. 840 841 sub InFooAndBar { 842 return <<'END'; 843 +main::Foo 844 &main::Bar 845 END 846 } 847 848 It's important to remember not to use "&" for the first set -- that 849 would be intersecting with nothing (resulting in an empty set). 850 851 =head2 User-Defined Case Mappings 852 853 You can also define your own mappings to be used in the lc(), 854 lcfirst(), uc(), and ucfirst() (or their string-inlined versions). 855 The principle is similar to that of user-defined character 856 properties: to define subroutines in the C<main> package 857 with names like C<ToLower> (for lc() and lcfirst()), C<ToTitle> (for 858 the first character in ucfirst()), and C<ToUpper> (for uc(), and the 859 rest of the characters in ucfirst()). 860 861 The string returned by the subroutines needs now to be three 862 hexadecimal numbers separated by tabulators: start of the source 863 range, end of the source range, and start of the destination range. 864 For example: 865 866 sub ToUpper { 867 return <<END; 868 0061\t0063\t0041 869 END 870 } 871 872 defines an uc() mapping that causes only the characters "a", "b", and 873 "c" to be mapped to "A", "B", "C", all other characters will remain 874 unchanged. 875 876 If there is no source range to speak of, that is, the mapping is from 877 a single character to another single character, leave the end of the 878 source range empty, but the two tabulator characters are still needed. 879 For example: 880 881 sub ToLower { 882 return <<END; 883 0041\t\t0061 884 END 885 } 886 887 defines a lc() mapping that causes only "A" to be mapped to "a", all 888 other characters will remain unchanged. 889 890 (For serious hackers only) If you want to introspect the default 891 mappings, you can find the data in the directory 892 C<$Config{privlib}>/F<unicore/To/>. The mapping data is returned as 893 the here-document, and the C<utf8::ToSpecFoo> are special exception 894 mappings derived from <$Config{privlib}>/F<unicore/SpecialCasing.txt>. 895 The C<Digit> and C<Fold> mappings that one can see in the directory 896 are not directly user-accessible, one can use either the 897 C<Unicode::UCD> module, or just match case-insensitively (that's when 898 the C<Fold> mapping is used). 899 900 A final note on the user-defined case mappings: they will be used 901 only if the scalar has been marked as having Unicode characters. 902 Old byte-style strings will not be affected. 903 904 =head2 Character Encodings for Input and Output 905 906 See L<Encode>. 907 908 =head2 Unicode Regular Expression Support Level 909 910 The following list of Unicode support for regular expressions describes 911 all the features currently supported. The references to "Level N" 912 and the section numbers refer to the Unicode Technical Standard #18, 913 "Unicode Regular Expressions", version 11, in May 2005. 914 915 =over 4 916 917 =item * 918 919 Level 1 - Basic Unicode Support 920 921 RL1.1 Hex Notation - done [1] 922 RL1.2 Properties - done [2][3] 923 RL1.2a Compatibility Properties - done [4] 924 RL1.3 Subtraction and Intersection - MISSING [5] 925 RL1.4 Simple Word Boundaries - done [6] 926 RL1.5 Simple Loose Matches - done [7] 927 RL1.6 Line Boundaries - MISSING [8] 928 RL1.7 Supplementary Code Points - done [9] 929 930 [1] \x{...} 931 [2] \p{...} \P{...} 932 [3] supports not only minimal list (general category, scripts, 933 Alphabetic, Lowercase, Uppercase, WhiteSpace, 934 NoncharacterCodePoint, DefaultIgnorableCodePoint, Any, 935 ASCII, Assigned), but also bidirectional types, blocks, etc. 936 (see L</"Unicode Character Properties">) 937 [4] \d \D \s \S \w \W \X [:prop:] [:^prop:] 938 [5] can use regular expression look-ahead [a] or 939 user-defined character properties [b] to emulate set operations 940 [6] \b \B 941 [7] note that Perl does Full case-folding in matching, not Simple: 942 for example U+1F88 is equivalent with U+1F00 U+03B9, 943 not with 1F80. This difference matters for certain Greek 944 capital letters with certain modifiers: the Full case-folding 945 decomposes the letter, while the Simple case-folding would map 946 it to a single character. 947 [8] should do ^ and $ also on U+000B (\v in C), FF (\f), CR (\r), 948 CRLF (\r\n), NEL (U+0085), LS (U+2028), and PS (U+2029); 949 should also affect <>, $., and script line numbers; 950 should not split lines within CRLF [c] (i.e. there is no empty 951 line between \r and \n) 952 [9] UTF-8/UTF-EBDDIC used in perl allows not only U+10000 to U+10FFFF 953 but also beyond U+10FFFF [d] 954 955 [a] You can mimic class subtraction using lookahead. 956 For example, what UTS#18 might write as 957 958 [{Greek}-[{UNASSIGNED}]] 959 960 in Perl can be written as: 961 962 (?!\p{Unassigned})\p{InGreekAndCoptic} 963 (?=\p{Assigned})\p{InGreekAndCoptic} 964 965 But in this particular example, you probably really want 966 967 \p{GreekAndCoptic} 968 969 which will match assigned characters known to be part of the Greek script. 970 971 Also see the Unicode::Regex::Set module, it does implement the full 972 UTS#18 grouping, intersection, union, and removal (subtraction) syntax. 973 974 [b] '+' for union, '-' for removal (set-difference), '&' for intersection 975 (see L</"User-Defined Character Properties">) 976 977 [c] Try the C<:crlf> layer (see L<PerlIO>). 978 979 [d] Avoid C<use warning 'utf8';> (or say C<no warning 'utf8';>) to allow 980 U+FFFF (C<\x{FFFF}>). 981 982 =item * 983 984 Level 2 - Extended Unicode Support 985 986 RL2.1 Canonical Equivalents - MISSING [10][11] 987 RL2.2 Default Grapheme Clusters - MISSING [12][13] 988 RL2.3 Default Word Boundaries - MISSING [14] 989 RL2.4 Default Loose Matches - MISSING [15] 990 RL2.5 Name Properties - MISSING [16] 991 RL2.6 Wildcard Properties - MISSING 992 993 [10] see UAX#15 "Unicode Normalization Forms" 994 [11] have Unicode::Normalize but not integrated to regexes 995 [12] have \X but at this level . should equal that 996 [13] UAX#29 "Text Boundaries" considers CRLF and Hangul syllable 997 clusters as a single grapheme cluster. 998 [14] see UAX#29, Word Boundaries 999 [15] see UAX#21 "Case Mappings" 1000 [16] have \N{...} but neither compute names of CJK Ideographs 1001 and Hangul Syllables nor use a loose match [e] 1002 1003 [e] C<\N{...}> allows namespaces (see L<charnames>). 1004 1005 =item * 1006 1007 Level 3 - Tailored Support 1008 1009 RL3.1 Tailored Punctuation - MISSING 1010 RL3.2 Tailored Grapheme Clusters - MISSING [17][18] 1011 RL3.3 Tailored Word Boundaries - MISSING 1012 RL3.4 Tailored Loose Matches - MISSING 1013 RL3.5 Tailored Ranges - MISSING 1014 RL3.6 Context Matching - MISSING [19] 1015 RL3.7 Incremental Matches - MISSING 1016 ( RL3.8 Unicode Set Sharing ) 1017 RL3.9 Possible Match Sets - MISSING 1018 RL3.10 Folded Matching - MISSING [20] 1019 RL3.11 Submatchers - MISSING 1020 1021 [17] see UAX#10 "Unicode Collation Algorithms" 1022 [18] have Unicode::Collate but not integrated to regexes 1023 [19] have (?<=x) and (?=x), but look-aheads or look-behinds should see 1024 outside of the target substring 1025 [20] need insensitive matching for linguistic features other than case; 1026 for example, hiragana to katakana, wide and narrow, simplified Han 1027 to traditional Han (see UTR#30 "Character Foldings") 1028 1029 =back 1030 1031 =head2 Unicode Encodings 1032 1033 Unicode characters are assigned to I<code points>, which are abstract 1034 numbers. To use these numbers, various encodings are needed. 1035 1036 =over 4 1037 1038 =item * 1039 1040 UTF-8 1041 1042 UTF-8 is a variable-length (1 to 6 bytes, current character allocations 1043 require 4 bytes), byte-order independent encoding. For ASCII (and we 1044 really do mean 7-bit ASCII, not another 8-bit encoding), UTF-8 is 1045 transparent. 1046 1047 The following table is from Unicode 3.2. 1048 1049 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 1050 1051 U+0000..U+007F 00..7F 1052 U+0080..U+07FF C2..DF 80..BF 1053 U+0800..U+0FFF E0 A0..BF 80..BF 1054 U+1000..U+CFFF E1..EC 80..BF 80..BF 1055 U+D000..U+D7FF ED 80..9F 80..BF 1056 U+D800..U+DFFF ******* ill-formed ******* 1057 U+E000..U+FFFF EE..EF 80..BF 80..BF 1058 U+10000..U+3FFFF F0 90..BF 80..BF 80..BF 1059 U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF 1060 U+100000..U+10FFFF F4 80..8F 80..BF 80..BF 1061 1062 Note the C<A0..BF> in C<U+0800..U+0FFF>, the C<80..9F> in 1063 C<U+D000...U+D7FF>, the C<90..B>F in C<U+10000..U+3FFFF>, and the 1064 C<80...8F> in C<U+100000..U+10FFFF>. The "gaps" are caused by legal 1065 UTF-8 avoiding non-shortest encodings: it is technically possible to 1066 UTF-8-encode a single code point in different ways, but that is 1067 explicitly forbidden, and the shortest possible encoding should always 1068 be used. So that's what Perl does. 1069 1070 Another way to look at it is via bits: 1071 1072 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte 1073 1074 0aaaaaaa 0aaaaaaa 1075 00000bbbbbaaaaaa 110bbbbb 10aaaaaa 1076 ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa 1077 00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa 1078 1079 As you can see, the continuation bytes all begin with C<10>, and the 1080 leading bits of the start byte tell how many bytes the are in the 1081 encoded character. 1082 1083 =item * 1084 1085 UTF-EBCDIC 1086 1087 Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe. 1088 1089 =item * 1090 1091 UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks) 1092 1093 The followings items are mostly for reference and general Unicode 1094 knowledge, Perl doesn't use these constructs internally. 1095 1096 UTF-16 is a 2 or 4 byte encoding. The Unicode code points 1097 C<U+0000..U+FFFF> are stored in a single 16-bit unit, and the code 1098 points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is 1099 using I<surrogates>, the first 16-bit unit being the I<high 1100 surrogate>, and the second being the I<low surrogate>. 1101 1102 Surrogates are code points set aside to encode the C<U+10000..U+10FFFF> 1103 range of Unicode code points in pairs of 16-bit units. The I<high 1104 surrogates> are the range C<U+D800..U+DBFF>, and the I<low surrogates> 1105 are the range C<U+DC00..U+DFFF>. The surrogate encoding is 1106 1107 $hi = ($uni - 0x10000) / 0x400 + 0xD800; 1108 $lo = ($uni - 0x10000) % 0x400 + 0xDC00; 1109 1110 and the decoding is 1111 1112 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00); 1113 1114 If you try to generate surrogates (for example by using chr()), you 1115 will get a warning if warnings are turned on, because those code 1116 points are not valid for a Unicode character. 1117 1118 Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16 1119 itself can be used for in-memory computations, but if storage or 1120 transfer is required either UTF-16BE (big-endian) or UTF-16LE 1121 (little-endian) encodings must be chosen. 1122 1123 This introduces another problem: what if you just know that your data 1124 is UTF-16, but you don't know which endianness? Byte Order Marks, or 1125 BOMs, are a solution to this. A special character has been reserved 1126 in Unicode to function as a byte order marker: the character with the 1127 code point C<U+FEFF> is the BOM. 1128 1129 The trick is that if you read a BOM, you will know the byte order, 1130 since if it was written on a big-endian platform, you will read the 1131 bytes C<0xFE 0xFF>, but if it was written on a little-endian platform, 1132 you will read the bytes C<0xFF 0xFE>. (And if the originating platform 1133 was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.) 1134 1135 The way this trick works is that the character with the code point 1136 C<U+FFFE> is guaranteed not to be a valid Unicode character, so the 1137 sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in 1138 little-endian format" and cannot be C<U+FFFE>, represented in big-endian 1139 format". 1140 1141 =item * 1142 1143 UTF-32, UTF-32BE, UTF-32LE 1144 1145 The UTF-32 family is pretty much like the UTF-16 family, expect that 1146 the units are 32-bit, and therefore the surrogate scheme is not 1147 needed. The BOM signatures will be C<0x00 0x00 0xFE 0xFF> for BE and 1148 C<0xFF 0xFE 0x00 0x00> for LE. 1149 1150 =item * 1151 1152 UCS-2, UCS-4 1153 1154 Encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit 1155 encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>, 1156 because it does not use surrogates. UCS-4 is a 32-bit encoding, 1157 functionally identical to UTF-32. 1158 1159 =item * 1160 1161 UTF-7 1162 1163 A seven-bit safe (non-eight-bit) encoding, which is useful if the 1164 transport or storage is not eight-bit safe. Defined by RFC 2152. 1165 1166 =back 1167 1168 =head2 Security Implications of Unicode 1169 1170 =over 4 1171 1172 =item * 1173 1174 Malformed UTF-8 1175 1176 Unfortunately, the specification of UTF-8 leaves some room for 1177 interpretation of how many bytes of encoded output one should generate 1178 from one input Unicode character. Strictly speaking, the shortest 1179 possible sequence of UTF-8 bytes should be generated, 1180 because otherwise there is potential for an input buffer overflow at 1181 the receiving end of a UTF-8 connection. Perl always generates the 1182 shortest length UTF-8, and with warnings on Perl will warn about 1183 non-shortest length UTF-8 along with other malformations, such as the 1184 surrogates, which are not real Unicode code points. 1185 1186 =item * 1187 1188 Regular expressions behave slightly differently between byte data and 1189 character (Unicode) data. For example, the "word character" character 1190 class C<\w> will work differently depending on if data is eight-bit bytes 1191 or Unicode. 1192 1193 In the first case, the set of C<\w> characters is either small--the 1194 default set of alphabetic characters, digits, and the "_"--or, if you 1195 are using a locale (see L<perllocale>), the C<\w> might contain a few 1196 more letters according to your language and country. 1197 1198 In the second case, the C<\w> set of characters is much, much larger. 1199 Most importantly, even in the set of the first 256 characters, it will 1200 probably match different characters: unlike most locales, which are 1201 specific to a language and country pair, Unicode classifies all the 1202 characters that are letters I<somewhere> as C<\w>. For example, your 1203 locale might not think that LATIN SMALL LETTER ETH is a letter (unless 1204 you happen to speak Icelandic), but Unicode does. 1205 1206 As discussed elsewhere, Perl has one foot (two hooves?) planted in 1207 each of two worlds: the old world of bytes and the new world of 1208 characters, upgrading from bytes to characters when necessary. 1209 If your legacy code does not explicitly use Unicode, no automatic 1210 switch-over to characters should happen. Characters shouldn't get 1211 downgraded to bytes, either. It is possible to accidentally mix bytes 1212 and characters, however (see L<perluniintro>), in which case C<\w> in 1213 regular expressions might start behaving differently. Review your 1214 code. Use warnings and the C<strict> pragma. 1215 1216 =back 1217 1218 =head2 Unicode in Perl on EBCDIC 1219 1220 The way Unicode is handled on EBCDIC platforms is still 1221 experimental. On such platforms, references to UTF-8 encoding in this 1222 document and elsewhere should be read as meaning the UTF-EBCDIC 1223 specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues 1224 are specifically discussed. There is no C<utfebcdic> pragma or 1225 ":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean 1226 the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> 1227 for more discussion of the issues. 1228 1229 =head2 Locales 1230 1231 Usually locale settings and Unicode do not affect each other, but 1232 there are a couple of exceptions: 1233 1234 =over 4 1235 1236 =item * 1237 1238 You can enable automatic UTF-8-ification of your standard file 1239 handles, default C<open()> layer, and C<@ARGV> by using either 1240 the C<-C> command line switch or the C<PERL_UNICODE> environment 1241 variable, see L<perlrun> for the documentation of the C<-C> switch. 1242 1243 =item * 1244 1245 Perl tries really hard to work both with Unicode and the old 1246 byte-oriented world. Most often this is nice, but sometimes Perl's 1247 straddling of the proverbial fence causes problems. 1248 1249 =back 1250 1251 =head2 When Unicode Does Not Happen 1252 1253 While Perl does have extensive ways to input and output in Unicode, 1254 and few other 'entry points' like the @ARGV which can be interpreted 1255 as Unicode (UTF-8), there still are many places where Unicode (in some 1256 encoding or another) could be given as arguments or received as 1257 results, or both, but it is not. 1258 1259 The following are such interfaces. For all of these interfaces Perl 1260 currently (as of 5.8.3) simply assumes byte strings both as arguments 1261 and results, or UTF-8 strings if the C<encoding> pragma has been used. 1262 1263 One reason why Perl does not attempt to resolve the role of Unicode in 1264 this cases is that the answers are highly dependent on the operating 1265 system and the file system(s). For example, whether filenames can be 1266 in Unicode, and in exactly what kind of encoding, is not exactly a 1267 portable concept. Similarly for the qx and system: how well will the 1268 'command line interface' (and which of them?) handle Unicode? 1269 1270 =over 4 1271 1272 =item * 1273 1274 chdir, chmod, chown, chroot, exec, link, lstat, mkdir, 1275 rename, rmdir, stat, symlink, truncate, unlink, utime, -X 1276 1277 =item * 1278 1279 %ENV 1280 1281 =item * 1282 1283 glob (aka the <*>) 1284 1285 =item * 1286 1287 open, opendir, sysopen 1288 1289 =item * 1290 1291 qx (aka the backtick operator), system 1292 1293 =item * 1294 1295 readdir, readlink 1296 1297 =back 1298 1299 =head2 Forcing Unicode in Perl (Or Unforcing Unicode in Perl) 1300 1301 Sometimes (see L</"When Unicode Does Not Happen">) there are 1302 situations where you simply need to force Perl to believe that a byte 1303 string is UTF-8, or vice versa. The low-level calls 1304 utf8::upgrade($bytestring) and utf8::downgrade($utf8string) are 1305 the answers. 1306 1307 Do not use them without careful thought, though: Perl may easily get 1308 very confused, angry, or even crash, if you suddenly change the 'nature' 1309 of scalar like that. Especially careful you have to be if you use the 1310 utf8::upgrade(): any random byte string is not valid UTF-8. 1311 1312 =head2 Using Unicode in XS 1313 1314 If you want to handle Perl Unicode in XS extensions, you may find the 1315 following C APIs useful. See also L<perlguts/"Unicode Support"> for an 1316 explanation about Unicode at the XS level, and L<perlapi> for the API 1317 details. 1318 1319 =over 4 1320 1321 =item * 1322 1323 C<DO_UTF8(sv)> returns true if the C<UTF8> flag is on and the bytes 1324 pragma is not in effect. C<SvUTF8(sv)> returns true is the C<UTF8> 1325 flag is on; the bytes pragma is ignored. The C<UTF8> flag being on 1326 does B<not> mean that there are any characters of code points greater 1327 than 255 (or 127) in the scalar or that there are even any characters 1328 in the scalar. What the C<UTF8> flag means is that the sequence of 1329 octets in the representation of the scalar is the sequence of UTF-8 1330 encoded code points of the characters of a string. The C<UTF8> flag 1331 being off means that each octet in this representation encodes a 1332 single character with code point 0..255 within the string. Perl's 1333 Unicode model is not to use UTF-8 until it is absolutely necessary. 1334 1335 =item * 1336 1337 C<uvuni_to_utf8(buf, chr)> writes a Unicode character code point into 1338 a buffer encoding the code point as UTF-8, and returns a pointer 1339 pointing after the UTF-8 bytes. 1340 1341 =item * 1342 1343 C<utf8_to_uvuni(buf, lenp)> reads UTF-8 encoded bytes from a buffer and 1344 returns the Unicode character code point and, optionally, the length of 1345 the UTF-8 byte sequence. 1346 1347 =item * 1348 1349 C<utf8_length(start, end)> returns the length of the UTF-8 encoded buffer 1350 in characters. C<sv_len_utf8(sv)> returns the length of the UTF-8 encoded 1351 scalar. 1352 1353 =item * 1354 1355 C<sv_utf8_upgrade(sv)> converts the string of the scalar to its UTF-8 1356 encoded form. C<sv_utf8_downgrade(sv)> does the opposite, if 1357 possible. C<sv_utf8_encode(sv)> is like sv_utf8_upgrade except that 1358 it does not set the C<UTF8> flag. C<sv_utf8_decode()> does the 1359 opposite of C<sv_utf8_encode()>. Note that none of these are to be 1360 used as general-purpose encoding or decoding interfaces: C<use Encode> 1361 for that. C<sv_utf8_upgrade()> is affected by the encoding pragma 1362 but C<sv_utf8_downgrade()> is not (since the encoding pragma is 1363 designed to be a one-way street). 1364 1365 =item * 1366 1367 C<is_utf8_char(s)> returns true if the pointer points to a valid UTF-8 1368 character. 1369 1370 =item * 1371 1372 C<is_utf8_string(buf, len)> returns true if C<len> bytes of the buffer 1373 are valid UTF-8. 1374 1375 =item * 1376 1377 C<UTF8SKIP(buf)> will return the number of bytes in the UTF-8 encoded 1378 character in the buffer. C<UNISKIP(chr)> will return the number of bytes 1379 required to UTF-8-encode the Unicode character code point. C<UTF8SKIP()> 1380 is useful for example for iterating over the characters of a UTF-8 1381 encoded buffer; C<UNISKIP()> is useful, for example, in computing 1382 the size required for a UTF-8 encoded buffer. 1383 1384 =item * 1385 1386 C<utf8_distance(a, b)> will tell the distance in characters between the 1387 two pointers pointing to the same UTF-8 encoded buffer. 1388 1389 =item * 1390 1391 C<utf8_hop(s, off)> will return a pointer to an UTF-8 encoded buffer 1392 that is C<off> (positive or negative) Unicode characters displaced 1393 from the UTF-8 buffer C<s>. Be careful not to overstep the buffer: 1394 C<utf8_hop()> will merrily run off the end or the beginning of the 1395 buffer if told to do so. 1396 1397 =item * 1398 1399 C<pv_uni_display(dsv, spv, len, pvlim, flags)> and 1400 C<sv_uni_display(dsv, ssv, pvlim, flags)> are useful for debugging the 1401 output of Unicode strings and scalars. By default they are useful 1402 only for debugging--they display B<all> characters as hexadecimal code 1403 points--but with the flags C<UNI_DISPLAY_ISPRINT>, 1404 C<UNI_DISPLAY_BACKSLASH>, and C<UNI_DISPLAY_QQ> you can make the 1405 output more readable. 1406 1407 =item * 1408 1409 C<ibcmp_utf8(s1, pe1, u1, l1, u1, s2, pe2, l2, u2)> can be used to 1410 compare two strings case-insensitively in Unicode. For case-sensitive 1411 comparisons you can just use C<memEQ()> and C<memNE()> as usual. 1412 1413 =back 1414 1415 For more information, see L<perlapi>, and F<utf8.c> and F<utf8.h> 1416 in the Perl source code distribution. 1417 1418 =head1 BUGS 1419 1420 =head2 Interaction with Locales 1421 1422 Use of locales with Unicode data may lead to odd results. Currently, 1423 Perl attempts to attach 8-bit locale info to characters in the range 1424 0..255, but this technique is demonstrably incorrect for locales that 1425 use characters above that range when mapped into Unicode. Perl's 1426 Unicode support will also tend to run slower. Use of locales with 1427 Unicode is discouraged. 1428 1429 =head2 Interaction with Extensions 1430 1431 When Perl exchanges data with an extension, the extension should be 1432 able to understand the UTF8 flag and act accordingly. If the 1433 extension doesn't know about the flag, it's likely that the extension 1434 will return incorrectly-flagged data. 1435 1436 So if you're working with Unicode data, consult the documentation of 1437 every module you're using if there are any issues with Unicode data 1438 exchange. If the documentation does not talk about Unicode at all, 1439 suspect the worst and probably look at the source to learn how the 1440 module is implemented. Modules written completely in Perl shouldn't 1441 cause problems. Modules that directly or indirectly access code written 1442 in other programming languages are at risk. 1443 1444 For affected functions, the simple strategy to avoid data corruption is 1445 to always make the encoding of the exchanged data explicit. Choose an 1446 encoding that you know the extension can handle. Convert arguments passed 1447 to the extensions to that encoding and convert results back from that 1448 encoding. Write wrapper functions that do the conversions for you, so 1449 you can later change the functions when the extension catches up. 1450 1451 To provide an example, let's say the popular Foo::Bar::escape_html 1452 function doesn't deal with Unicode data yet. The wrapper function 1453 would convert the argument to raw UTF-8 and convert the result back to 1454 Perl's internal representation like so: 1455 1456 sub my_escape_html ($) { 1457 my($what) = shift; 1458 return unless defined $what; 1459 Encode::decode_utf8(Foo::Bar::escape_html(Encode::encode_utf8($what))); 1460 } 1461 1462 Sometimes, when the extension does not convert data but just stores 1463 and retrieves them, you will be in a position to use the otherwise 1464 dangerous Encode::_utf8_on() function. Let's say the popular 1465 C<Foo::Bar> extension, written in C, provides a C<param> method that 1466 lets you store and retrieve data according to these prototypes: 1467 1468 $self->param($name, $value); # set a scalar 1469 $value = $self->param($name); # retrieve a scalar 1470 1471 If it does not yet provide support for any encoding, one could write a 1472 derived class with such a C<param> method: 1473 1474 sub param { 1475 my($self,$name,$value) = @_; 1476 utf8::upgrade($name); # make sure it is UTF-8 encoded 1477 if (defined $value) { 1478 utf8::upgrade($value); # make sure it is UTF-8 encoded 1479 return $self->SUPER::param($name,$value); 1480 } else { 1481 my $ret = $self->SUPER::param($name); 1482 Encode::_utf8_on($ret); # we know, it is UTF-8 encoded 1483 return $ret; 1484 } 1485 } 1486 1487 Some extensions provide filters on data entry/exit points, such as 1488 DB_File::filter_store_key and family. Look out for such filters in 1489 the documentation of your extensions, they can make the transition to 1490 Unicode data much easier. 1491 1492 =head2 Speed 1493 1494 Some functions are slower when working on UTF-8 encoded strings than 1495 on byte encoded strings. All functions that need to hop over 1496 characters such as length(), substr() or index(), or matching regular 1497 expressions can work B<much> faster when the underlying data are 1498 byte-encoded. 1499 1500 In Perl 5.8.0 the slowness was often quite spectacular; in Perl 5.8.1 1501 a caching scheme was introduced which will hopefully make the slowness 1502 somewhat less spectacular, at least for some operations. In general, 1503 operations with UTF-8 encoded strings are still slower. As an example, 1504 the Unicode properties (character classes) like C<\p{Nd}> are known to 1505 be quite a bit slower (5-20 times) than their simpler counterparts 1506 like C<\d> (then again, there 268 Unicode characters matching C<Nd> 1507 compared with the 10 ASCII characters matching C<d>). 1508 1509 =head2 Porting code from perl-5.6.X 1510 1511 Perl 5.8 has a different Unicode model from 5.6. In 5.6 the programmer 1512 was required to use the C<utf8> pragma to declare that a given scope 1513 expected to deal with Unicode data and had to make sure that only 1514 Unicode data were reaching that scope. If you have code that is 1515 working with 5.6, you will need some of the following adjustments to 1516 your code. The examples are written such that the code will continue 1517 to work under 5.6, so you should be safe to try them out. 1518 1519 =over 4 1520 1521 =item * 1522 1523 A filehandle that should read or write UTF-8 1524 1525 if ($] > 5.007) { 1526 binmode $fh, ":encoding(utf8)"; 1527 } 1528 1529 =item * 1530 1531 A scalar that is going to be passed to some extension 1532 1533 Be it Compress::Zlib, Apache::Request or any extension that has no 1534 mention of Unicode in the manpage, you need to make sure that the 1535 UTF8 flag is stripped off. Note that at the time of this writing 1536 (October 2002) the mentioned modules are not UTF-8-aware. Please 1537 check the documentation to verify if this is still true. 1538 1539 if ($] > 5.007) { 1540 require Encode; 1541 $val = Encode::encode_utf8($val); # make octets 1542 } 1543 1544 =item * 1545 1546 A scalar we got back from an extension 1547 1548 If you believe the scalar comes back as UTF-8, you will most likely 1549 want the UTF8 flag restored: 1550 1551 if ($] > 5.007) { 1552 require Encode; 1553 $val = Encode::decode_utf8($val); 1554 } 1555 1556 =item * 1557 1558 Same thing, if you are really sure it is UTF-8 1559 1560 if ($] > 5.007) { 1561 require Encode; 1562 Encode::_utf8_on($val); 1563 } 1564 1565 =item * 1566 1567 A wrapper for fetchrow_array and fetchrow_hashref 1568 1569 When the database contains only UTF-8, a wrapper function or method is 1570 a convenient way to replace all your fetchrow_array and 1571 fetchrow_hashref calls. A wrapper function will also make it easier to 1572 adapt to future enhancements in your database driver. Note that at the 1573 time of this writing (October 2002), the DBI has no standardized way 1574 to deal with UTF-8 data. Please check the documentation to verify if 1575 that is still true. 1576 1577 sub fetchrow { 1578 my($self, $sth, $what) = @_; # $what is one of fetchrow_{array,hashref} 1579 if ($] < 5.007) { 1580 return $sth->$what; 1581 } else { 1582 require Encode; 1583 if (wantarray) { 1584 my @arr = $sth->$what; 1585 for (@arr) { 1586 defined && /[^\000-\177]/ && Encode::_utf8_on($_); 1587 } 1588 return @arr; 1589 } else { 1590 my $ret = $sth->$what; 1591 if (ref $ret) { 1592 for my $k (keys %$ret) { 1593 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret->{$k}; 1594 } 1595 return $ret; 1596 } else { 1597 defined && /[^\000-\177]/ && Encode::_utf8_on($_) for $ret; 1598 return $ret; 1599 } 1600 } 1601 } 1602 } 1603 1604 1605 =item * 1606 1607 A large scalar that you know can only contain ASCII 1608 1609 Scalars that contain only ASCII and are marked as UTF-8 are sometimes 1610 a drag to your program. If you recognize such a situation, just remove 1611 the UTF8 flag: 1612 1613 utf8::downgrade($val) if $] > 5.007; 1614 1615 =back 1616 1617 =head1 SEE ALSO 1618 1619 L<perlunitut>, L<perluniintro>, L<Encode>, L<open>, L<utf8>, L<bytes>, 1620 L<perlretut>, L<perlvar/"${^UNICODE}"> 1621 1622 =cut
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated: Tue Mar 17 22:47:18 2015 | Cross-referenced by PHPXref 0.7.1 |