Leaked source code of windows server 2003
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

150 lines
4.3 KiB

  1. package Text::Soundex;
  2. require 5.000;
  3. require Exporter;
  4. @ISA = qw(Exporter);
  5. @EXPORT = qw(&soundex $soundex_nocode);
  6. $VERSION = '1.0';
  7. # $Id: soundex.pl,v 1.2 1994/03/24 00:30:27 mike Exp $
  8. #
  9. # Implementation of soundex algorithm as described by Knuth in volume
  10. # 3 of The Art of Computer Programming, with ideas stolen from Ian
  11. # Phillips <[email protected]>.
  12. #
  13. # Mike Stok <[email protected]>, 2 March 1994.
  14. #
  15. # Knuth's test cases are:
  16. #
  17. # Euler, Ellery -> E460
  18. # Gauss, Ghosh -> G200
  19. # Hilbert, Heilbronn -> H416
  20. # Knuth, Kant -> K530
  21. # Lloyd, Ladd -> L300
  22. # Lukasiewicz, Lissajous -> L222
  23. #
  24. # $Log: soundex.pl,v $
  25. # Revision 1.2 1994/03/24 00:30:27 mike
  26. # Subtle bug (any excuse :-) spotted by Rich Pinder <[email protected]>
  27. # in the way I handles leasing characters which were different but had
  28. # the same soundex code. This showed up comparing it with Oracle's
  29. # soundex output.
  30. #
  31. # Revision 1.1 1994/03/02 13:01:30 mike
  32. # Initial revision
  33. #
  34. #
  35. ##############################################################################
  36. # $soundex_nocode is used to indicate a string doesn't have a soundex
  37. # code, I like undef other people may want to set it to 'Z000'.
  38. $soundex_nocode = undef;
  39. sub soundex
  40. {
  41. local (@s, $f, $fc, $_) = @_;
  42. push @s, '' unless @s; # handle no args as a single empty string
  43. foreach (@s)
  44. {
  45. $_ = uc $_;
  46. tr/A-Z//cd;
  47. if ($_ eq '')
  48. {
  49. $_ = $soundex_nocode;
  50. }
  51. else
  52. {
  53. ($f) = /^(.)/;
  54. tr/AEHIOUWYBFPVCGJKQSXZDTLMNR/00000000111122222222334556/;
  55. ($fc) = /^(.)/;
  56. s/^$fc+//;
  57. tr///cs;
  58. tr/0//d;
  59. $_ = $f . $_ . '000';
  60. s/^(.{4}).*/$1/;
  61. }
  62. }
  63. wantarray ? @s : shift @s;
  64. }
  65. 1;
  66. __END__
  67. =head1 NAME
  68. Text::Soundex - Implementation of the Soundex Algorithm as Described by Knuth
  69. =head1 SYNOPSIS
  70. use Text::Soundex;
  71. $code = soundex $string; # get soundex code for a string
  72. @codes = soundex @list; # get list of codes for list of strings
  73. # set value to be returned for strings without soundex code
  74. $soundex_nocode = 'Z000';
  75. =head1 DESCRIPTION
  76. This module implements the soundex algorithm as described by Donald Knuth
  77. in Volume 3 of B<The Art of Computer Programming>. The algorithm is
  78. intended to hash words (in particular surnames) into a small space using a
  79. simple model which approximates the sound of the word when spoken by an English
  80. speaker. Each word is reduced to a four character string, the first
  81. character being an upper case letter and the remaining three being digits.
  82. If there is no soundex code representation for a string then the value of
  83. C<$soundex_nocode> is returned. This is initially set to C<undef>, but
  84. many people seem to prefer an I<unlikely> value like C<Z000>
  85. (how unlikely this is depends on the data set being dealt with.) Any value
  86. can be assigned to C<$soundex_nocode>.
  87. In scalar context C<soundex> returns the soundex code of its first
  88. argument, and in list context a list is returned in which each element is the
  89. soundex code for the corresponding argument passed to C<soundex> e.g.
  90. @codes = soundex qw(Mike Stok);
  91. leaves C<@codes> containing C<('M200', 'S320')>.
  92. =head1 EXAMPLES
  93. Knuth's examples of various names and the soundex codes they map to
  94. are listed below:
  95. Euler, Ellery -> E460
  96. Gauss, Ghosh -> G200
  97. Hilbert, Heilbronn -> H416
  98. Knuth, Kant -> K530
  99. Lloyd, Ladd -> L300
  100. Lukasiewicz, Lissajous -> L222
  101. so:
  102. $code = soundex 'Knuth'; # $code contains 'K530'
  103. @list = soundex qw(Lloyd Gauss); # @list contains 'L300', 'G200'
  104. =head1 LIMITATIONS
  105. As the soundex algorithm was originally used a B<long> time ago in the US
  106. it considers only the English alphabet and pronunciation.
  107. As it is mapping a large space (arbitrary length strings) onto a small
  108. space (single letter plus 3 digits) no inference can be made about the
  109. similarity of two strings which end up with the same soundex code. For
  110. example, both C<Hilbert> and C<Heilbronn> end up with a soundex code
  111. of C<H416>.
  112. =head1 AUTHOR
  113. This code was implemented by Mike Stok (C<[email protected]>) from the
  114. description given by Knuth. Ian Phillips (C<[email protected]>) and Rich Pinder
  115. (C<[email protected]>) supplied ideas and spotted mistakes.