There is a huge lot of data on Perl's regular expressions out there, but of course the real leader is 'Mastering Regular Expressions' by J. Friedl, worthy to read, worthy to disagree and, after all, worthy to learn. The issues we're going in for here are covering to some extent Unicode and how Perl's regexp can help in handling something with it.
FYI: perldoc perlunicode - Perl Unicode support
To do one's homework the right way, he has to get some confident understanding (based mostly on his own experience, of course) of how regular expressions work and especially the parts related to building regexp objects and the embedded code, that is during matching or substitution to dynamically execute some piece of Perl code inside m// or s///. It would be nice if the reader has gotten already some experience in Unicode and there won't be necessary to remind him that under 'use utf8' or 'use encoding "utf8"' pragmas it's almost always advisable to add one more - 'use re "eval"' so that to make any snippet within a regexp executable by perl interpreter. Nor to recall such Perl Unicode ABC notion like behaviour of '\w' in some home-grown alphanumeric class of symbols.
The last note right before the start: 'use locale', sure, puts the death shadow on the cross-platform design of your applications, especially when there's something like '\p{InCyrillic}' or '\p{InCJKCompatibilityIdeographs}' because beasts like these are getting extremely sensitive in regular expressions to locale settings and 'use bytes' and 'no bytes' contexts.
FYI: perldoc utf8 - Perl pragma to enable/disable UTF-8 (or UTF-EBCDIC) in source code
In order to do away with the introduction let's refresh a row of basic functions that can be of a good help in baking and stuffing some Unicodish frameworks:
#!/usr/bin/perl
use strict;
use encoding 'utf8';
our ( @packed, );
@packed = map{chr($_)} (0x4E00 .. 0x9FFF);
print $_, "\n" for @packed[0 .. 9];
exit(0);
this short and speedy 'CJK character generator' is able to give birth to more than 20 000 characters in no time, that one can find in Chinese, Japanese, Korean and, probably, in the old Vietnamese. The line:@packed = map{chr($_)} (0x4E00 .. 0x9FFF);
may be seamlessly replaced with something like following:@packed = map{pack 'U', $_} (0x4E00 .. 0x9FFF);
for this 'pack' and 'chr' are interchangable, no doubt.
Now let's do some matching job. The job should be effective enough without any temporary variables that take a significant piece of memory; in other words, we have to make use of Perl's built-ins. Again, let's take some Chinese string as a sample due to a good amount of Unicode codepoints devoted to this language:
#!/usr/bin/perl
use strict;
use utf8;
use encoding 'utf8';
our ( $re, $sample );
$re = qr{(
[\p{InCJKUnifiedIdeographs}] |
[\p{InCJKCompatibility}] |
[\p{InCJKCompatibilityForms}] |
[\p{InCJKRadicalsSupplement}] |
[\p{InCJKCompatibilityIdeographsSupplement}] |
[\p{InCJKUnifiedIdeographsExtensionA}] |
[\p{InCJKCompatibilityIdeographs}] |
[\p{InCJKUnifiedIdeographsExtensionB}])}ox;
$sample = '五行: 一曰水, 二曰火, 三曰木, 四曰金, 五曰土
пять стихий: первая называется вода,
вторая — огонь, третья — дерево,
четвёртая — металл,
пятая — земля Five Elements: first is Water,
second is Fire, third is Wood, fourth is Metal, fifth is Earth';
while($sample =~ m/$re/g)
{
print $1,"\n";
}
exit(0);
As one can see the sample consists of three languages, from which Perl carefully has chosen, one by one, (as it was said to him) those of CJK family. For those who has forgotten what kind of Latin bull is placed right after '$re' regexp object, we advise to go to 'perldoc perlre' listed in Perl pod; and for those Perl critters, who likes to know everything about anything we've got few questions: (1) where did Perl get the value for '$1' variable? (2) what's the difference between 'use utf8' and 'use encoding "utf8"'? (actually, why both of them are here?!) (3) why 'ox' and 'g' regexp modifiers are separated here? In case you've got all answers ready at once, you're the best!
It is the right time to bring in more dynamical pictures, that is, to breathe some spirit into regular expressions. Check this:
#!/usr/bin/perl
use strict;
use utf8;
use encoding 'utf8';
use re 'eval';
our ( $re, $index, $num, $line, $key1,
$key2, @estimate, @over, @under, %got, );
$re = qr{(
[\p{InCJKUnifiedIdeographs}] |
[\p{InCJKCompatibility}] |
[\p{InCJKCompatibilityForms}] |
[\p{InCJKRadicalsSupplement}] |
[\p{InCJKCompatibilityIdeographsSupplement}] |
[\p{InCJKUnifiedIdeographsExtensionA}] |
[\p{InCJKCompatibilityIdeographs}] |
[\p{InCJKUnifiedIdeographsExtensionB}])}ox;
@over = qw(估計過高 переоценивать overestimate);
@under = qw(估計過低 недооценивать underestimate);
@estimate = (@over, @under);
$index = $num = 0;
grep{
$num++;
($line) = $_;
$got{$num}{s/$re/$index++/e} = $line if(/$re/);
} @estimate;
foreach $key1 ( sort {$a <=> $b} keys %got )
{
print "\n\nelement number:\t", $key1, "\n";
foreach $key2 ( sort {$a <=> $b} keys %{ $got{ $key1 } } )
{
print "match number: \t", $key2, "\t",
$got{ $key1 }{ $key2 }, "\n";
}
}
exit(0);
The script possesses some redundancy, but this is for teaching purposes only and produces no performance hits: just benchmark anything similar with seen-so-often if/else branching and measure the time. What's special on these 40 and something lines? There are two things deserving your, dear reader, attention:
(1) using 'our'; as you probably noticed, neither passing, nor returning any "parameters" are present. There exists from the very beginning a hash '%got', which is what every line of the script is using in turn;
(2) there's a conditional increment for '$index', which is what gives the matched subscript numbers to get back to the source array if needed.
Another way of thinking regexp like way is to use the embedded code; it looks very much alike the thing described above, however, it is not only a different flavour of using regular expressions, but also another way of building conditions while matching:
#!/usr/bin/perl
use strict;
use utf8;
use encoding 'utf8';
use re 'eval';
our ( $re, $index, $num, $sample1,
$sample2, $sample3, %src, %got, );
$re = qr{(
[\p{InCJKUnifiedIdeographs}] |
[\p{InCJKCompatibility}] |
[\p{InCJKCompatibilityForms}] |
[\p{InCJKRadicalsSupplement}] |
[\p{InCJKCompatibilityIdeographsSupplement}] |
[\p{InCJKUnifiedIdeographsExtensionA}] |
[\p{InCJKCompatibilityIdeographs}] |
[\p{InCJKUnifiedIdeographsExtensionB}])}ox;
$sample1 = q[一萬一千一百一十一 11111];
$sample2 = q[天一地二天三地四
небу соответствует число один,
земле — два, небу— три, земле — четыре... (и т. д.; «Ицзин»)
one belongs to the Heaven,
two belongs to the Earth,
three belongs to the Heaven,
four belongs to the Earth... (etc, "YiJing")];
$sample3 =q[道生一, 一生二, 二生三, 三生萬物
Дао рождает одно (нерасчленённое единство),
одно рождает два (раздвоенность),
два рождает три (триаду),
от трёх рождаются все существа (вещи)
Dao bears one (indivisible unity),
one gives birth to the two (duality),
two produces three (triad),
those three flood everything (all things in the world)];
grep{ $src{$index++} = $_} ($sample1, $sample2, $sample3);
grep{
grep{
m#($re) ? (??{$got{$_}++ if($1 ne '')})#cgx;
} split //, $src{$_};
} sort keys %src;
foreach $num ( keys %got )
{
print 'the match: ', "\t", $num, ' was seen:', "\t",
$got{ $num }, " time(s)\n";
}
exit(0);
Building '%src' is artificial here, but in real life hashes can do a lot of useful job and the extreme of such usefulness is a transparency of hash use through the whole script. Another interesting point is the way 'split' treats strings, because it really splits strings with Unicode in mind. One can find more things like this, for example: cho(m)p; but be aware: not everything is smooth in this respect - just try to play with 'use/no bytes' and 'length'.
0 коммент.:
Отправить комментарий