1. A method implemented by a stylus-based personal computer for recognizing handwriting of a particular user, the method comprising:
storing a set of handwriting samples of a particular user;
receiving a handwritten inputs from the particular user for recognition;
separately providing the handwritten input to first and second recognition engines, respectively, each of the first and second recognition engines applying a separate process for recognizing the handwritten input;
producing by the recognition process of the first recognition engine a first list of alternative classifications of the handwritten input by matching features of the handwritten input to features of the stored handwriting samples of the particular user, each of the classifications of the first list comprising a potential character with an associated recognition probability;
producing by the recognition process of the second recognition engine a second list of alternative classifications of the handwritten input without utilizing the stored handwriting samples, the process of the second recognition engine being designed to generically recognize handwriting for a plurality of users, each of the classifications of the second list comprising a potential character with an associated recognition probability;
applying data of the handwritten input, the first list, and the second list to a comparative neural network for processing, the applied data including a context feature of the handwritten input, a descriptive feature of a potential character and an associated recognition probability from the first list, and a descriptive feature of a potential character and an associated probability from the second list;
outputting a character as a recognition result of the handwritten input based on the processing of the neural network, the outputted character being one of the potential characters in the first and second lists; and
adapting the stored set of handwriting samples by adding a previously received handwritten input from the particular user to the set before a subsequently received handwritten input from the particular user is provided to the first and second recognition engines,
wherein the neural network is configured to merge the first and second lists by coalescing common classifications of the handwritten input in the first and second lists,
wherein the neural network is based on a computational model comprising a group of interconnected processing units, and
wherein the recognition process of the first recognition engine matches context features of the subsequently received handwritten input to the context features of the adapted set of handwriting samples to produce the first list for the subsequently received handwritten input.
2. The method recited in claim 1, wherein the first and second lists are generated by respective handwriting recognition engines.
3. The method recited in claim 1, wherein
the neural network merges the first and second lists into a combined list of alternative classifications, each classification of the combined list comprising a potential character from at least one of the first and second lists, the neural network producing an associated recognition probability for each potential character in the combined list, and
the outputting step outputs the potential character in the combined list having a highest associated recognition probability as the recognition result.
4. The method recited in claim 1, further comprising merging the first and second lists of alternative classifications, each classification in the combined list comprising a potential character from at least one of the first and second lists, such that the potential characters are ordered according to likelihood of representing the handwritten input.
5. The method recited in claim 1, wherein the neural network employs a mergesort process to compare potential characters from the first and second lists.
6. The method of claim 1, further comprising:
selecting the first recognition engine from a plurality of generic recognition engines based on a number of strokes in the handwritten input;
using the selected generic recognition engine to generate the second list.
7. A classification tool implemented in a computer, comprising:
a stylus-based input device for receiving a handwritten inputs from a particular user;
a storage device for storing a set of handwriting samples of the particular user;
a user-specific recognition engine configured to apply a first recognition process for producing a first list of alternative classifications of the handwritten input by matching features of the handwritten input to features of the stored handwriting samples of the particular user, each of the classifications of the first list comprising a potential character with an associated recognition probability;
a generic recognition engine configured to apply a second recognition process for producing a second list of alternative classifications of the handwritten input without utilizing the stored handwriting samples of the particular user, the generic recognition engine being designed to generically recognize handwriting for a plurality of users, each of the classifications of the second list comprising a potential character with an associated recognition probability;
a comparative network that receives and processes data including a context feature of the handwritten input, a descriptive feature of a potential character and an associated probability of the first list, and a descriptive feature of a potential character and an associated probability of the second list, wherein the classification tool is configured to output a recognition result of the handwritten input based on the processing of the neural network, the outputted character being one of the potential characters in the first and second lists; and
a trainer component that adapts the stored set of handwriting samples by adding a previously received handwritten input from the particular user to the stored set before a subsequent handwritten input is received from the particular user,
wherein the user-specific recognition engine and the generic recognition engine are configured such that the first and second recognition processes are applied separately to the handwritten input,
wherein the neural network is configured to merge the first and second lists by coalescing common classifications of the handwritten input in the first and second lists,
wherein the neural network is based on a computational model comprising a group of interconnected processing units, and
wherein the recognition process of the user-specific recognition engine is configured to match context features of the subsequently received handwritten input to the context features of the adapted set of handwriting samples in order to produce the first list for the subsequently received handwritten input.
8. The classification tool recited in claim 7, wherein the neural network is configured as a comparative neural network employing a mergesort process to compare potential characters from the first and second lists.
9. The classification tool of claim 7, further comprising:
a plurality of generic recognition engines;
a recognition engine selection module for selecting one of the generic recognition engines to produce the first list based on a number of strokes in the handwritten input.
10. The classification tool recited in claim 7, wherein
the neural network is configured to merge the first and second lists into a combined list of alternative classifications, each classification of the combined list comprising a potential character from at least one of the first and second lists, the neural network producing an associated recognition probability for each potential character in the combined list, and
the classification tool outputs the potential character from the combined list that has a highest associated recognition probability as the recognition result.
11. A computer-implemented method for recognizing handwritten characters, comprising:
storing a set of handwriting samples of a particular user;
providing first and second generic recognition engines configured to generically recognize handwritten characters for a plurality of users, the first generic recognition engine employing a neural network-based technique to recognize the handwritten characters, the second generic recognition engine employing a nearest neighbor matching technique to recognize the handwritten characters, wherein neither the technique of the first generic recognition engine nor the technique of the second recognition employs the stored set of handwriting samples to recognize the handwritten characters;
providing a user-specific recognition engine configured to recognize a particular user’s handwritten characters by matching features of the handwritten characters to features of stored handwriting samples obtained from the particular user;
receiving a handwritten input from the particular user;
selecting between the first and second generic recognition engines based on a number of strokes in the handwritten input;
applying the handwritten input to the selected generic recognition module to generate a first list of potential characters, the potential characters in the first list being associated with recognition probabilities;
applying the handwritten input to the user-specific recognition engine to generate a second list of potential characters, the potential characters in the second list being associated with recognition probabilities, wherein the handwritten input is applied separately to the selected generic recognition engine and the user-specific recognition engine;
using a comparative neural network to process data of the handwritten input, the first list, and the second list;
choosing one of the potential candidates from the first and second list as a recognition result for the handwritten input based on the processing of the neural network; and
adapting the stored set of handwriting samples by adding a previously received handwritten input from the particular user to the set before a subsequently received handwritten input from the particular user is provided to the first and second recognition engines.
wherein the neural network merges the first and second lists by coalescing common classifications of the handwritten input in the first and second lists,
wherein the neural network is based on a computational model comprising a group of interconnected processing units, and
wherein the recognition process of the first recognition engine matches context features of the subsequently received handwritten input to the context features of the adapted set of handwriting samples to produce the first list for the subsequently received handwritten input.
12. The method recited in claim 11, wherein
the neural network merges the first and second lists into a combined list of alternative classifications, each classification of the combined list comprising a potential character from at least one of the first and second lists, the neural network producing an associated recognition probability for each potential character in the combined list, and
the choosing step chooses the potential character from the combined list that has a highest associated recognition probability as the recognition result.
The claims below are in addition to those above.
All refrences to claim(s) which appear below refer to the numbering after this setence.
1. A method comprising:
defining a multi-dimensional vector space;
reducing each of a plurality of electronic documents to a corresponding multi-dimensional vector based upon the defined multi-dimensional vector space;
calculating a distance between each corresponding multi-dimensional vector of one or more portions of the plurality of corresponding multi-dimensional vectors, each portion of the plurality of corresponding multi-dimensional vectors containing a plurality of corresponding multi-dimensional vectors; and
determining one or more classifications for one or more respective portions of the electronic documents based upon the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space.
2. The method of claim 1 where the electronic documents have been initially assigned to one of a number of categories.
3. The method of claim 1 wherein the dimensions of the multi-dimensional vector space are defined by at least one feature.
4. The method of claim 3 wherein each of the at least one feature is selected based upon the differentiation ability of the feature.
5. The method of claim 3 wherein the at least one feature is based upon criteria selected from the group consisting of selected words, selected phrases, algorithms, phone numbers, and URLs.
6. The method of claim 5 where an algorithm returns a description of the structure and text of the electronic document.
7. The method of claim 6 where the algorithm extracts a pattern from the electronic document.
8. The method of claim 7 where the algorithm is a regular expression.
9. The method of claim 3 wherein each of the at least one feature is weighted based upon a differentiation ability of the feature.
10. The method of claim 9 wherein the feature weighting is based upon a rarity of occurrence in the multi-dimensional vector space.
11. The method of claim 9 wherein the feature weighting is based upon an occurrence in particular category and non-occurrence in at least one other category.
12. The method of claim 3 wherein the at least one feature is derived from a corpus of categorized electronic documents.
13. The method of claim 3 wherein the electronic document is reduced to a corresponding multi-dimensional vector based upon an occurrence and frequency of the at least one feature.
14. The method of claim 1 wherein the electronic document is an electronic communication.
15. The method of claim 14 wherein the electronic communication is an e-mail.
16. The method of claim 1 wherein the electronic document is an electronic publication.
17. The method of claim 16 wherein the electronic document is a world wide web page.
18. The method of claim 1 wherein the corresponding multi-dimensional vector indicates an occurrence and a frequency of one or more of the features in the defined vector space.
19. The method of claim 1 wherein determining one or more classifications for one or more respective portions of the electronic documents further comprises:
comparing the calculated distance between each corresponding multi-dimensional vector to a specified distance;
determining if the distance between two or more multi-dimensional vectors is within a specified distance;
determining that two or more multi-dimensional vectors having a distance between them that is within the specified distance constitute a cluster; and
designating a classification for this cluster.
20. The method of claim 19 further comprising:
designating the classification of a cluster based upon the features of the two or more multi-dimensional vectors that constitute the cluster.
21. The method of claim 1 wherein the distance between each corresponding multi-dimensional vector of one or more portions of the plurality of corresponding multi-dimensional vectors is calculated using a specific distance metric.
22. The method of claim 21 wherein the specific distance metric is a cosine similarity distance metric.
23. The method of claim 21 wherein the specific distance metric is a ratio of weighted feature frequencies for the features the two multi-dimensional vectors have in common and weighted feature frequencies for the all features for the two multi-dimensional vectors.
24. The method of claim 21 wherein the specific distance metric is selected from the group of distance metrics consisting of a non-zero dimension proportionality distance metric, a Manhattan distance metric, a Euclidean distance metric, a cosine similarity distance metric, and combinations thereof.
25. The method of claim 19 wherein the specified distance is a distance range.
26. The method of claim 19 further comprising:
specifying a second distance;
comparing the calculated distance between each corresponding multi-dimensional vector to the second distance;
determining if the distance between two or more multi-dimensional vectors is within the second distance;
determining that two or more multi-dimensional vectors having a distance between them that is within the second distance constitute an additional cluster; and
designating a classification to the additional cluster.
27. The method of claim 1 wherein a plurality of classifications has been determined, further comprising:
specifying a second distance;
examining the classifications that result from the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space; and
determining one or more additional classifications for one or more respective portions of the electronic documents based upon the second distance and the classifications that result from the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space.
28. A machine-readable medium having stored thereon a set of instructions which when executed cause a system to perform a method comprising:
defining a multi-dimensional vector space;
reducing each of a plurality of electronic documents to a corresponding multi-dimensional vector based upon the defined multi-dimensional vector space;
calculating a distance between each corresponding multi-dimensional vector of one or more portions of the plurality of corresponding multi-dimensional vectors, each portion of the plurality of corresponding multi-dimensional vectors containing a plurality of corresponding multi-dimensional vectors; and
determining one or more classifications for one or more respective portions of the electronic documents based upon the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space.
29. The machine-readable medium of claim 28 where the electronic documents have been initially assigned to one of a number of categories.
30. The machine-readable medium of claim 28 wherein the dimensions of the multi-dimensional vector space are defined by at least one feature.
31. The machine-readable medium of claim 30 wherein each of the at least one feature is selected based upon the differentiation ability of the feature.
32. The machine-readable medium of claim 30 wherein the at least one feature is based upon criteria selected from the group consisting of selected words, selected phrases, algorithms, phone numbers, and URLs.
33. The machine-readable medium of claim 32 where an algorithm returns a description of the structure and text of the electronic document.
34. The machine-readable medium of claim 33 where the algorithm extracts a pattern from the electronic document.
35. The machine-readable medium of claim 34 where the algorithm is a regular expression.
36. The machine-readable medium of claim 30 wherein each of the at least one feature is weighted based upon a differentiation ability of the feature.
37. The machine-readable medium of claim 36 wherein the feature weighting is based upon a rarity of occurrence in the multi-dimensional vector space.
38. The machine-readable medium of claim 36 wherein the feature weighting is based upon an occurrence in particular category and non-occurrence in at least one other category.
39. The machine-readable medium of claim 30 wherein the at least one feature is derived from a corpus of categorized electronic documents.
40. The machine-readable medium of claim 30 wherein the electronic document is reduced to a corresponding multi-dimensional vector based upon an occurrence and frequency of the at least one feature.
41. The machine-readable medium of claim 28 wherein the electronic document is an electronic communication.
42. The machine-readable medium of claim 41 wherein the electronic communication is an e-mail.
43. The machine-readable medium of claim 28 wherein the electronic document is an electronic publication.
44. The machine-readable medium of claim 43 wherein the electronic document is a world wide web page.
45. The machine-readable medium of claim 28 wherein the corresponding multi-dimensional vector indicates an occurrence and a frequency of one or more of the features in the defined vector space.
46. The machine-readable medium of claim 28 wherein the method further comprises:
comparing the calculated distance between each corresponding multi-dimensional vector to a specified distance;
determining if the distance between two or more multi-dimensional vectors is within a specified distance;
determining that two or more multi-dimensional vectors having a distance between them that is within the specified distance constitute a cluster; and
designating a classification for this cluster.
47. The machine-readable medium of claim 46 wherein the method further comprises:
designating the classification of a cluster based upon the features of the two or more multi-dimensional vectors that constitute the cluster.
48. The machine-readable medium of claim 28 wherein the distance between each corresponding multi-dimensional vector of one or more portions of the plurality of corresponding multi-dimensional vectors is calculated using a specific distance metric.
49. The machine-readable medium of claim 48 wherein the specific distance metric is a cosine similarity distance metric.
50. The machine-readable medium of claim 48 wherein the specific distance metric is a ratio of weighted feature frequencies for the features the two multi-dimensional vectors have in common and weighted feature frequencies for the all features for the two multi-dimensional vectors.
51. The machine-readable medium of claim 48 wherein the specific distance metric is selected from the group of distance metrics consisting of a non-zero dimension proportionality distance metric, a Manhattan distance metric, a Euclidean distance metric, a cosine similarity distance metric, and combinations thereof.
52. The machine-readable medium of claim 46 wherein the specified distance is a distance range.
53. The machine-readable medium of claim 46 wherein the method further comprises:
specifying a second distance;
comparing the calculated distance between each corresponding multi-dimensional vector to the second distance;
determining if the distance between two or more multi-dimensional vectors is within the second distance;
determining that two or more multi-dimensional vectors having a distance between them that is within the second distance constitute an additional cluster; and
designating a classification to the additional cluster.
54. The machine-readable medium of claim 28 wherein the method further comprises, upon determination of a plurality of classifications:
specifying a second distance;
examining the classifications that result from the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space; and
determining one or more additional classifications for one or more respective portions of the electronic documents based upon the second distance and the classifications that result from the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space.
55. A system comprising:
a processor;
a network interface coupled to the processor; and
a machine-readable medium having stored thereon a set of instructions which when executed cause the system to perform a method comprising:
reducing each of a plurality of electronic documents to a corresponding multi-dimensional vector based upon the defined multi-dimensional vector space;
calculating a distance between each corresponding multi-dimensional vector of one or more portions of the plurality of corresponding multi-dimensional vectors, each portion of the plurality of corresponding multi-dimensional vectors containing a plurality of corresponding multi-dimensional vectors; and
determining one or more classifications for one or more respective portions of the electronic documents based upon the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space.
56. The system of claim 55 where the electronic documents have been initially assigned to one of a number of categories.
57. The system of claim 55 wherein the dimensions of the multi-dimensional vector space are defined by at least one feature.
58. The system of claim 57 wherein each of the at least one feature is selected based upon the differentiation ability of the feature.
59. The system of claim 57 wherein the at least one feature is based upon criteria selected from the group consisting of selected words, selected phrases, algorithms, phone numbers, and URLs.
60. The system of claim 59 where an algorithm returns a description of the structure and text of the electronic document.
61. The system of claim 60 where the algorithm extracts a pattern from the electronic document.
62. The system of claim 61 where the algorithm is a regular expression.
63. The system of claim 57 wherein each of the at least one feature is weighted based upon a differentiation ability of the feature.
64. The system of claim 63 wherein the feature weighting is based upon a rarity of occurrence in the multi-dimensional vector space.
65. The system of claim 63 wherein the feature weighting is based upon an occurrence in particular category and non-occurrence in at least one other category.
66. The system of claim 57 wherein the at least one feature is derived from a corpus of categorized electronic documents.
67. The system of claim 57 wherein the electronic document is reduced to a corresponding multi-dimensional vector based upon an occurrence and frequency of the at least one feature.
68. The system of claim 55 wherein the electronic document is an electronic communication.
69. The system of claim 68 wherein the electronic communication is an e-mail.
70. The system of claim 55 wherein the electronic document is an electronic publication.
71. The system of claim 70 wherein the electronic document is a world wide web page.
72. The system of claim 55 wherein the corresponding multi-dimensional vector indicates an occurrence and a frequency of one or more of the features in the defined vector space.
73. The system of claim 55 wherein the method further comprises:
comparing the calculated distance between each corresponding multi-dimensional vector to a specified distance;
determining if the distance between two or more multi-dimensional vectors is within a specified distance;
determining that two or more multi-dimensional vectors having a distance between them that is within the specified distance constitute a cluster; and
designating a classification for this cluster.
74. The system of claim 73 wherein the method further comprises:
designating the classification of a cluster based upon the features of the two or more multi-dimensional vectors that constitute the cluster.
75. The system of claim 55 wherein the distance between each corresponding multi-dimensional vector of one or more portions of the plurality of corresponding multi-dimensional vectors is calculated using a specific distance metric.
76. The system of claim 75 wherein the specific distance metric is a cosine similarity distance metric.
77. The system of claim 75 wherein the specific distance metric is a ratio of weighted feature frequencies for the features the two multi-dimensional vectors have in common and weighted feature frequencies for the all features for the two multi-dimensional vectors.
78. The system of claim 75 wherein the specific distance metric is selected from the group of distance metrics consisting of a non-zero dimension proportionality distance metric, a Manhattan distance metric, a Euclidean distance metric, a cosine similarity distance metric, and combinations thereof.
79. The system of claim 73 wherein the specified distance is a distance range.
80. The system of claim 73 wherein the method further comprises:
specifying a second distance;
comparing the calculated distance between each corresponding multi-dimensional vector to the second distance;
determining if the distance between two or more multi-dimensional vectors is within the second distance;
determining that two or more multi-dimensional vectors having a distance between them that is within the second distance constitute an additional cluster; and
designating a classification to the additional cluster.
81. The system of claim 55 wherein the method further comprises, upon determination of a plurality of classifications:
specifying a second distance;
examining the classifications that result from the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space; and
determining one or more additional classifications for one or more respective portions of the electronic documents based upon the second distance and the classifications that result from the calculated distances, properties of the multi-dimensional vectors, and properties of the defined multi-dimensional vector space.