One of the areas which needs the most improvement is in classification of documents. Each document contains metadata for several hierarchies. In order to discuss this in a simpler fashion I will work on one hierarchy -- Industry. The industry hierarchy helps classify documents according to which industry they belong to and at which level. There is a possible depth of 6 in the hierarchy. So, a document can be tagged in the following manners:
Industry: Consumer Products
Industry: Consumer Products\Food
Industry: Consumer Products\Food\Beverages
Industry: Consumer Products\Food\Beverages\SoftDrinks
It is not mandatory that each document goes to the 6th level because articles can be more or less specific regarding an industry. The previous post regarding searching has already pointed out the problems with our current taxonomy. This post will expand on that one by referring to real current problems.
When you look at the distribution of the number of terms at each level of the hierarchy it breaks down like this:
| Hierarchy | 1 | 2 | 3 | 4 | 5 | 6 |
| Industry: | 14 | 96 | 275 | 380 | 412 | 421 |
| Capability: | 11 | 62 | 164 | 215 | 215 | 215 |
| Cases: | 15 | 99 | 292 | 412 | 456 | 464 |
It is that last issue which our current system faces. In order to increase recall, expansive dictionaries have been created in order to increase recall by expanding the query. Of course, this has a negative effect on precision. And while recall is now increased, the drop in precision means more sifting for the user. Add to this poor relevancy ranking and search result becomes next to meaningless.
Perhaps it is best to leave heirarchy tagging only to 3 levels. After 3 levels the hierarchy taxonomy serves no purpose.
If a document needs more information not contained in the document itself or in the metadata already gathered, it would be possible to add another field to the document such as "Industry Tags". This could be free form modifiers and text supplied by authors and knowledge managers. Currently only the knowledge managers tag and classify the documents. This leaves alot of room for interpretation in terms of specificity. And as we mentioned before, the taxonomy is out of control. A knowedge manager can add new terms if that person deems it necessary, and hen dealing with very low levels of granularity, that situation is likely to arise.
Instead, an Indutry Tag field is used (mostly by authors and users) to help define the document. Users working in the same area with the same businesses are more likely to define documents in a similar manner. Authors are better able to summarize or keyword-ize their own works, at least better than a disinterested 3rd party Knowledge Manager.
Stopping at 3 or 4 levels creates a better search experience since now, the navigation terms contain fewer terms and more documents per term. The industry tags, now set by actual consumers of the text can then narrow down the results within that more general hierarchy.
No comments:
Post a Comment