Wednesday, March 11, 2009

And then there were 3

As we move forward implementing a new search engine, it is becoming increasingly clear that organization of the data to be indexed is quite a mess. Although the documents themselves are fine, how metadata is attached and ordered is scatterbrained to say the least.

One of the areas which needs the most improvement is in classification of documents. Each document contains metadata for several hierarchies. In order to discuss this in a simpler fashion I will work on one hierarchy -- Industry. The industry hierarchy helps classify documents according to which industry they belong to and at which level. There is a possible depth of 6 in the hierarchy. So, a document can be tagged in the following manners:

Industry: Consumer Products
Industry: Consumer Products\Food
Industry: Consumer Products\Food\Beverages
Industry: Consumer Products\Food\Beverages\SoftDrinks

It is not mandatory that each document goes to the 6th level because articles can be more or less specific regarding an industry. The previous post regarding searching has already pointed out the problems with our current taxonomy. This post will expand on that one by referring to real current problems.

When you look at the distribution of the number of terms at each level of the hierarchy it breaks down like this:




Hierarchy123456
Industry:1496275380412421
Capability:1162164215215215
Cases:1599292412456464
In each of these levels, the higher levels are included. These numbers are the count of terms in each level. In each hierarchy you can see that level three (or possibly four) is the climax of term creation. After that, the number of documents actually tagged 4,5, and 6 levels is very low. The documents with that level of tagging doeas not even reach a tenth of one percent in the total number of documents. It is highly unlikely that a user is searching through navigation to get to that level of granularity. In addition the UI for such granularity creates a nightmare for navigation. Finally, this granularity creates a user base which tries to be a specific as possible thus reducing recall (the number of hits returned from search).

It is that last issue which our current system faces. In order to increase recall, expansive dictionaries have been created in order to increase recall by expanding the query. Of course, this has a negative effect on precision. And while recall is now increased, the drop in precision means more sifting for the user. Add to this poor relevancy ranking and search result becomes next to meaningless.

Perhaps it is best to leave heirarchy tagging only to 3 levels. After 3 levels the hierarchy taxonomy serves no purpose.

If a document needs more information not contained in the document itself or in the metadata already gathered, it would be possible to add another field to the document such as "Industry Tags". This could be free form modifiers and text supplied by authors and knowledge managers. Currently only the knowledge managers tag and classify the documents. This leaves alot of room for interpretation in terms of specificity. And as we mentioned before, the taxonomy is out of control. A knowedge manager can add new terms if that person deems it necessary, and hen dealing with very low levels of granularity, that situation is likely to arise.

Instead, an Indutry Tag field is used (mostly by authors and users) to help define the document. Users working in the same area with the same businesses are more likely to define documents in a similar manner. Authors are better able to summarize or keyword-ize their own works, at least better than a disinterested 3rd party Knowledge Manager.

Stopping at 3 or 4 levels creates a better search experience since now, the navigation terms contain fewer terms and more documents per term. The industry tags, now set by actual consumers of the text can then narrow down the results within that more general hierarchy.

No comments:

Post a Comment