On Patent Text Clustering

Amongst various analyses performed on patents, the area where specialized software helps immensely is text-mining and two of the most popular text mining techniques used over patent data are:
– Text segmentation / Tokenization
– Text Clustering / Topic identification

Text segmentation is a process of analyzing the patent text and identifying smaller meaningful segments from the text. These segments are also called as Tokens or keywords. Various analyses can be performed using these tokens; however for large patent sets the number of unique tokens can be very high thereby making it unsuitable for certain types of analyses.

Text clustering helps identify important topics or concepts (clusters) from a set of documents. Clustering of key patent data documents (such as Title, Abstract and Claims) has been used in various Patent Analysis tools and can help bring out the otherwise hidden insights within patents. Analyzing relationships between generated clusters or analyzing relationships between patent classifications and clusters are popular mechanisms used by researchers especially those in a competitive intelligence role.

Classical algorithms used for clustering are K-Means or Bayesian Naïve. Their output is a set of categories (single level or hierarchical), each of which contain a group of patents. The name of the category is usually derived from the text of the patent itself and can also be a combination of multiple terms. Usually advanced algorithms can understand parts-of-speech and merge names accordingly. For instance, “tip of the probe…”, “using a probe tip…”, “with a probe whose tip is used for…”, and “probing the substrate with the tip…” will be merged under “probe tip”.

Algorithms can also differ on their capability to cluster patent under single or multiple topics. For patents it is desirable to have algorithms that place a patent under more than one topic since an attempt make a best-match for the patent under a single topic may lead to errors in interpretation.

One of the key challenges of clustering algorithms has been to make the overall process transparent for IP Professionals who usually find it hard to accept results of “black-box” clustering engines. Newer clustering algorithms provide the analyst control over most aspects of the clustering process thereby allowing then to fine tune the clustering output and train the algorithm to deliver more relevant results. The analyst can also now specify how deep s/he wants to go and choose to highlight broader or finer (more specific) topics.

Another challenge has been to make it easy for the analyst to influence the choice of the terms being picked up such to avoid picking up meaningless terms or to give importance to a certain class of terms. Apart from just being able to ignore some words (stop-words) it is important to have the power to set advanced filters that decide the choice of the labels used to represent the topic. For instance these can be, whether to ignore or give more importance to – words in ALL Caps, words starting with or containing a number, words containing more than one hyphen (compound names) or just words that are too long or too short. Even a minor change using such filters can have a reasonable effect on the topics that are chosen.

To see patent text clustering in action on patent data and understand its benefits via a sample use case scenario please read the full article here.