Thursday, February 26, 2009

Searching for Bibby Fisjer

I am ramping up to start our search project again. The search engine we are using is a 3rd party tool, so most of the work of the development team is going to lie in coding the front end.

But there is much more involved that just coding to an API. There is the actual business implementation of the search as well as the migration from our old system to the new one.

I plan to document as much of this process as possible here as this project should provide me if not the reader with hours of enjoyment. Here is problem #1:

Jargon


The following refers only to navigational searching, not full-text or keyword searching
Suppose there exists a standardized list of industries which the majority of large businesses use to catalog customers. Let's call it an ISO Industry List. For argument sake, let's say this list has 5 industries listed:
1 - Finance
2 - Tourism
3 - Healthcare
4 - Technology
5 - Manufacturing
These are common industries which are likely to exist into the foreseeable future. They are distinct enough so that two will not merge and become a hybrid, even if aspects of one incorporates features of another.

Let us also asuume a company exists which prefers to use its own jargon to label documents.* For whatever reason, whether valid or not, the ISO codes are not acceptable to this company and so they start creating their own list of industries. Let us assume the decision to do so was motivated by two main reasons:
1 - The ISO codes do not divide sectors appropriate for the business
2 - They just don't like the names in the ISO List.

As a result the company begins producing a list of its own based on what it perceives as industries pertinent to its business. Not having the resources to survey and create their own ISO type list, people being just adding to the list industries they think are valid. They create first pass:
1 - Banking
2 - Tourism
3 - Healthcare
4 - Medical Devices
5 - Manufacturing

For now this appears to do the trick. Then the company gets a new client which does not fit into this list and a new category is create:
6 - Private Equity
And then another:
7- Biotechnology
And another:
8- Pharmaceuticals

But then someone things that a client actually belons in two industries and creates:
9 - Biotech & Pharma.

Each of these categories has been attached to a document in our search engine. But before we get to the actualy documents, let's look how we already create a mess.

Industries are not atomistic


In our list of industries, even before we get to #9, we have created a list of industries which are neither atomistic nor reside on the same level of any reasonable heirarch. Ideally a list like the one above should have each leaf of the tree on the same level. Biotechnology is far more specific than Manufacturing. Creating this disjunction leads to confusion when adding new items since the person adding the new category is uncertain as to the level of specificity they should use.

Ever Changing Titles


Having this confusion leads to categories which more reflect jargon than concrete types. The list starts to grow each time a new category is added. The more industries are added the harder it becomes to determine which are valid and which are invalid. Adding industries for corner cases leads us to create industires for only one maybe two clients.

The documents are tainted


All of this above would not be such an issue if this list were not actually ties to document searching. Creating a list based on such whim creates the following issues when trying to search on the documents:
1 - If the jargon industry is changed, the document must be changed and re-indexed. In a system of thousands of documents there is a large overhead in terms of management and time.
2 - If the jargon industry represents corner cases of clients, the likelihood that someone will search on the word is slim. Although the precision on such a search is high, the recall for the document is low. In addition, documents are classified as different levels of a tree. When recall is high the specifity of the document as it relates to the search may not be clear.

Is tagging hierarchy the answer?


One might think the way to resolve the above issue is to tag each document with a heirarchy. But, besides the overhead of creating that much more data to index it does not solve the underlying problem of non-atomistic industries. A hierarchy can suffer from all of the pitfalls of a jargon induced list.

A way out


One way to solve this issue is to use the ISO list to tag documents. In fact that is the only way out, short of creating a new list formed in the same manner and with the same rigid standards as ISO. What we can then do is create a translation dictionary to translate jargon industries to ISO industries at search time. This allows us to maintain a certain sense of identity to the user while preserving the integrity of the documents. We can then create heirarchies either using the ISO lists or by creating our own without affecting the location of that document or having to re-index. We use these hierarchies to direct the search, but use the levels of the hierarchy to do the actual searching.
This solution solves the two reasons for creating one's own list above. first we can divide the industries by using our custom mapping diciotnary which will allow us to rename the categories. However, at the base level we retain our rigid atomistic separation. What we gain hoever is much more valuable. By using a standardized set of industries we are much less likely to end up with corner cases. An ISO type list has already gone through the rigid scrutiny to a level most businesses can't.

But how can we map the categories?


All of the above sounds great in theory, but the technical part is not so clear. Here is what I propose. Using an ISO list of industries and their hierarchies, we tag each document with the highest level and the lowest level that document refers to. For example:
Document #1
Title: Manufacturing in a Port Modern Era
High Industry: Manufacturing
Low Industry: Manufacturing
This document deals with general aspects of manufacturing so we tag the highest level and the lowest level the same.

Document #2
Title: Computer Chip Processing
High Industry: Manufacturing
Low Industry: Computer Components
This document deals with a specific type of manufacturing so we label it as this level

With our mapping dicitonary we can map over "Chip Processing" to "Computer Components" if all the computer components we ever deal with are microchips. If the company changes direction, we can expand our dictionary or shrink it as necessary. The dictionary allows for many to many relationships and can be used to expand or shrink our search as well. Again we must remember that this is for directed navigational searching and not key word searching. Whereas keywords and full-text searching is a clouded nebulous search, navigational searches should be uniform and atomistic.


*a document is any piece of content we wish to add to our search system

No comments:

Post a Comment