Using Thesaurus for Categorization

In Patent iNSIGHT Pro the standard thesaurus option allows you to maintain term taxonomies and Assignee groups. Mostly you would be using this for cleanup activities. However recently at a client group meet, I was asked if it was possible to store complex search queries as synonyms for a technology term and manage them in Thesaurus type files.

Before describing how, I will quickly explain the need. Most users create multiple UDC (User Defined Categories) sets as custom data points to map/analyze a patent set. Categories can be – By Functionality, By Ingredient, By Product, By Technology-specific class etc. To begin with a user creates the category schema and then using Advanced Search, performs complex searches within the report and pushes the resulting patents into each category. So it appears logical to save the searches done along with the category name in fashion similar to a thesaurus. This is helpful not just for automation but also to allow for future use in a different report.

So lets get to how we can do this. There is an option introduced last year called Create UDC from Excel, which allows you to prepare these categories-search mappings and store them in Excel files. These files are like Thesauruses that you can manage and extend over time.

For example, let’s take the Fuel Additives technology space. There was a report we did on this that is available here.

A portion of the categorization from the report is shown below:

Say we wanted to have a thesaurus for the Fuel Additive categories listed under “Types”. A Sample Excel Thesaurus for this shown below:

Note: I have shown only some of the categories under Types.

In the excel sheet notice that search strings are treated as synonyms and separated by a comma. By default, each string is assumed to be searched against full-text but if you want to restrict it to a portion then you can do so as shown for Lubricants above. Also for Lubricants you can see that, synonyms may not be a text term and can also be an IPC code (or USPC/ECLA) too.

The comma can be replaced with an OR operator to create a single search string without affecting results. Although managing different shorter strings as synonyms separated by a comma is easier to interpret and resembles a thesaurus.

The Excel file can be passed to the Create UDC from Excel feature and at the preview stage you can even drag-drop the category to create a hierarchy.

The synonyms (aka search strings) are searched in the text of patents and the results are pushed into corresponding categories.

So the excel file like the one shown above works like a thesaurus which you can prepare and maintain for your technology categorization purposes. It is also more flexible than a typical thesaurus since:

  • you can combine non-text portions such as Classification codes
  • you can use proximity operators within words
  • you can force some words to be searched only in claims and some other in full-text within the same synonym group
  • you can include ignore words by using NOT operator

The last point is pretty useful. For example a thesaurus entry for Antivirus (in biochemical field) can be:

Antivirus (Antivirus not w/5 (computer or network or internet)),

(Anti-virus not w/5 (computer or network or internet)),

(Protease w/2 inhibitor*),

Ribozyme,

Doesn’t this seem like the obvious thing to do if you want to save and automate the process of categorizing records for use in other reports?