At what level does tagging a document in a hierarchy become meaningless. A break down of our term usages ranges from using the term "United States" 585 times which is 2.178 percent of our total documents to "Fund Strategy - SWF" used once for 0.004 percent. Now, this is all the terms against all the documents. This list of 1639 terms is usually broken into several hierarchies. Classification can range from 20 percent to less than a hundreth of a percent.
What is tragic and comic all in once is that in the tagging for case information the most frequent tagging is "No Tag" with 23% for capability hierarchy and 24% for industry hierarchy.
If such a high percentage of documents can go untagged, is it then necesary to tag cases at the opposite extreme. There is one case tagged "Enterprise ASP" for 0.003 percent. What makes it worse is that if such a tag is used so infrequently, what is the likelihood that someone will A-Search for the term or B-Know what the abbreviation is even for.
If these were internal use only taxonomies used to maintain directories and file manipulation, then such organization might matter. Unfortunately, these are customer facing choices and complexities.
What is getting confused is a taxonomical search vs keyword search. At the level where terms are arbitrarily added onto documents, you have moved from a purposeful taxonomy into a scattered keyword. While it does not matter how you find the document from a user's perspective, different strategies should be used for tagging data when filling out taxonomies vs keywords. We expect rigidity in taxonomies and fluidity in keywords. Right now we have neither.
Wednesday, March 11, 2009
And then there were 3
As we move forward implementing a new search engine, it is becoming increasingly clear that organization of the data to be indexed is quite a mess. Although the documents themselves are fine, how metadata is attached and ordered is scatterbrained to say the least.
One of the areas which needs the most improvement is in classification of documents. Each document contains metadata for several hierarchies. In order to discuss this in a simpler fashion I will work on one hierarchy -- Industry. The industry hierarchy helps classify documents according to which industry they belong to and at which level. There is a possible depth of 6 in the hierarchy. So, a document can be tagged in the following manners:
Industry: Consumer Products
Industry: Consumer Products\Food
Industry: Consumer Products\Food\Beverages
Industry: Consumer Products\Food\Beverages\SoftDrinks
It is not mandatory that each document goes to the 6th level because articles can be more or less specific regarding an industry. The previous post regarding searching has already pointed out the problems with our current taxonomy. This post will expand on that one by referring to real current problems.
When you look at the distribution of the number of terms at each level of the hierarchy it breaks down like this:
In each of these levels, the higher levels are included. These numbers are the count of terms in each level. In each hierarchy you can see that level three (or possibly four) is the climax of term creation. After that, the number of documents actually tagged 4,5, and 6 levels is very low. The documents with that level of tagging doeas not even reach a tenth of one percent in the total number of documents. It is highly unlikely that a user is searching through navigation to get to that level of granularity. In addition the UI for such granularity creates a nightmare for navigation. Finally, this granularity creates a user base which tries to be a specific as possible thus reducing recall (the number of hits returned from search).
It is that last issue which our current system faces. In order to increase recall, expansive dictionaries have been created in order to increase recall by expanding the query. Of course, this has a negative effect on precision. And while recall is now increased, the drop in precision means more sifting for the user. Add to this poor relevancy ranking and search result becomes next to meaningless.
Perhaps it is best to leave heirarchy tagging only to 3 levels. After 3 levels the hierarchy taxonomy serves no purpose.
If a document needs more information not contained in the document itself or in the metadata already gathered, it would be possible to add another field to the document such as "Industry Tags". This could be free form modifiers and text supplied by authors and knowledge managers. Currently only the knowledge managers tag and classify the documents. This leaves alot of room for interpretation in terms of specificity. And as we mentioned before, the taxonomy is out of control. A knowedge manager can add new terms if that person deems it necessary, and hen dealing with very low levels of granularity, that situation is likely to arise.
Instead, an Indutry Tag field is used (mostly by authors and users) to help define the document. Users working in the same area with the same businesses are more likely to define documents in a similar manner. Authors are better able to summarize or keyword-ize their own works, at least better than a disinterested 3rd party Knowledge Manager.
Stopping at 3 or 4 levels creates a better search experience since now, the navigation terms contain fewer terms and more documents per term. The industry tags, now set by actual consumers of the text can then narrow down the results within that more general hierarchy.
One of the areas which needs the most improvement is in classification of documents. Each document contains metadata for several hierarchies. In order to discuss this in a simpler fashion I will work on one hierarchy -- Industry. The industry hierarchy helps classify documents according to which industry they belong to and at which level. There is a possible depth of 6 in the hierarchy. So, a document can be tagged in the following manners:
Industry: Consumer Products
Industry: Consumer Products\Food
Industry: Consumer Products\Food\Beverages
Industry: Consumer Products\Food\Beverages\SoftDrinks
It is not mandatory that each document goes to the 6th level because articles can be more or less specific regarding an industry. The previous post regarding searching has already pointed out the problems with our current taxonomy. This post will expand on that one by referring to real current problems.
When you look at the distribution of the number of terms at each level of the hierarchy it breaks down like this:
| Hierarchy | 1 | 2 | 3 | 4 | 5 | 6 |
| Industry: | 14 | 96 | 275 | 380 | 412 | 421 |
| Capability: | 11 | 62 | 164 | 215 | 215 | 215 |
| Cases: | 15 | 99 | 292 | 412 | 456 | 464 |
It is that last issue which our current system faces. In order to increase recall, expansive dictionaries have been created in order to increase recall by expanding the query. Of course, this has a negative effect on precision. And while recall is now increased, the drop in precision means more sifting for the user. Add to this poor relevancy ranking and search result becomes next to meaningless.
Perhaps it is best to leave heirarchy tagging only to 3 levels. After 3 levels the hierarchy taxonomy serves no purpose.
If a document needs more information not contained in the document itself or in the metadata already gathered, it would be possible to add another field to the document such as "Industry Tags". This could be free form modifiers and text supplied by authors and knowledge managers. Currently only the knowledge managers tag and classify the documents. This leaves alot of room for interpretation in terms of specificity. And as we mentioned before, the taxonomy is out of control. A knowedge manager can add new terms if that person deems it necessary, and hen dealing with very low levels of granularity, that situation is likely to arise.
Instead, an Indutry Tag field is used (mostly by authors and users) to help define the document. Users working in the same area with the same businesses are more likely to define documents in a similar manner. Authors are better able to summarize or keyword-ize their own works, at least better than a disinterested 3rd party Knowledge Manager.
Stopping at 3 or 4 levels creates a better search experience since now, the navigation terms contain fewer terms and more documents per term. The industry tags, now set by actual consumers of the text can then narrow down the results within that more general hierarchy.
A Mystery
I don't know why, but this makes me feel all warm and fuzzy inside. I haven't posted in a while. I have been trying to get started with JQuery. Trying to come up with an Ajax solution, I remember the great XDocument and the ability of VB to include literals in code.
Public Function GetEmployee(ByVal employee_code As String) As XDocument
Dim query As XDocument = <Root>
<%= From e In New EmployeeDataContext().employees Where e.employee_code = employee_code Select <name><%= e.last_name %></name> %>
</Root>
Return query
End Function
Public Function GetEmployee(ByVal employee_code As String) As XDocument
Dim query As XDocument = <Root>
<%= From e In New EmployeeDataContext().employees Where e.employee_code = employee_code Select <name><%= e.last_name %></name> %>
</Root>
Return query
End Function
Wednesday, March 4, 2009
First Responders
So you are invited to a birthday part of someone you don't know. You show up late, and as you open the door the guest of honor is blowing out the candles. From about 6 feet away, all you see is the end of the blowing, but notice the candles are not going out. Why?
This is a very simple example, but i use it to highlight something else.
When we encounter a problem in programming, what is our first response? What is our first gut reaction versus what are the first steps we take to resolve the issue?
Our reaction is heavily dependant on our own conditioning and experience both with development in general but also with the technology in particular.
This topic has been in my mind for quite a while for several reasons. The most prominent is this:
Who do you blame when something isn't working?
Essentially there are 3 main sources of potential "bugs" when trying to debug one's code:
1 - Your code
2 - The other person's code
3 - The technology
Unless you are dealign with beta, CTP or RC software, you should look in the order above when trying to fix a bug. This is especially true in my position where most of our code is at a high level where syntax is more likely a problem than stumbling upon some obscure bug for some corner case.
Being a small shop that develops internal busniess applications we are relatively free to use newer technologies in order to brighten an otherwise dark and mundane day. Not all developers are on board with such an approach. Feeling relatively safe in their position doing things they have always done, they are reluctant to move to newer technologies and platforms. For them something new is more work not less. For them typing out 1000 lines of code is not an issue since there is no ultimate goal.
Me, I have youtue to check, people and articles to write, code to explore. I embrace technologies that allow me more free time with a better guarentee of performance. for me, the learning curve is built into project development timelines.
These other corporate coders can't see the ultimate payoff since for them, as each day passes, any learning curve becomes too steep to climb. Learning new technologies is not much different than running. If you run a mile everyday, that mile becomes easier and easier. If you never run, a quarter mile is daunting and the end might as well be 1000 miles away.
Learning new technologies not only helps you learn the technology, but gives insight on how to approach other newer technologies. You apply lessons learned form the learning itself.
The "meta" learning is what is at stake. The corporate coder who rarely ventures beyond his drag and drop world is quick to blame new technologies when they do not behave exactly as expected. They are quick to yell out "Well it doesn't work and there are alot of bugs." These developers know their little world so well, they project that level of knowledge to absolutely everything outside that world. If they can't figure something out, that thing must be broken or too complicated. It is important to prevent that stagnation.
What is the outcome of such a developer on a given project?
I recently had to work with such a developer and here were some of the outcomes:
The project used .Net table adapters for the data access layer. Without going into the pros and cons of the technology 3 of the 4 developers agreed to use this technology. The fourth developer eventually signed on (in words) to use table adpaters to access the data in the logic layer.
The very first issue came when in a meeting he said what we were trying to accomplish was impossible, he had never seen it doone and doubt we could do it. What was the task? Create an XML files from the table adapter, read the XML in to a table adapter, insert into the database from the table adapter. This is far from even a difficult task nevermind an impossible one. Within 30 minutes I had a working prototye for him using our project. I had actually created prodeuciton level code in a matter of minues. I am not a genius, that is what table adpaters are used for. Serialized data can be easily transformed into strongly typed datatables, generic datatables or XML.
Everything was going fine. For most people, this is not a sign, but I knew if I wasn't hearing anything, aomething was going wrong. The developer then tells me there is no way to set the connection string at runtime. He had not done any research, he had simply not been able to find a first glance ay of doing it. He had always set the connection string in the same way. Since it apparently couldnot be done in the same manner it was impossible. Without even looking at the code I offered several possible ways to set the connection string. I was able to do this, NOT because I had mastered the technology, but because I had taken time to learn how Microsoft does things. There were just some basic technieus which were probably in use in the table adapters.
One of the strength of the strongly typed datatables is that hey allow direct access to the data columns as row properties. This increased performance by elminating looking up columns by string values. However, when I looked into this developers code he was still using row.item("columnName") as opposed to row.ColumnName. By refactoring out the old way and using the new notation, I increased performance 400%. He just didn't understand how there could be a difference between an in memory object and a lookup. Since he never had to learn, he just didn't bother to learn.
This developer had one section of code where he had to do his own data manipulation. And guess what, he reverted to what he knew. I didn't bother refactoring this code because i didn't have time, I just added an abnstracted layer to make it work within the general framework. Sometimes you just don't have enough time.
To Be Continued....(this was getting too long)
This is a very simple example, but i use it to highlight something else.
When we encounter a problem in programming, what is our first response? What is our first gut reaction versus what are the first steps we take to resolve the issue?
Our reaction is heavily dependant on our own conditioning and experience both with development in general but also with the technology in particular.
This topic has been in my mind for quite a while for several reasons. The most prominent is this:
Who do you blame when something isn't working?
Essentially there are 3 main sources of potential "bugs" when trying to debug one's code:
1 - Your code
2 - The other person's code
3 - The technology
Unless you are dealign with beta, CTP or RC software, you should look in the order above when trying to fix a bug. This is especially true in my position where most of our code is at a high level where syntax is more likely a problem than stumbling upon some obscure bug for some corner case.
Being a small shop that develops internal busniess applications we are relatively free to use newer technologies in order to brighten an otherwise dark and mundane day. Not all developers are on board with such an approach. Feeling relatively safe in their position doing things they have always done, they are reluctant to move to newer technologies and platforms. For them something new is more work not less. For them typing out 1000 lines of code is not an issue since there is no ultimate goal.
Me, I have youtue to check, people and articles to write, code to explore. I embrace technologies that allow me more free time with a better guarentee of performance. for me, the learning curve is built into project development timelines.
These other corporate coders can't see the ultimate payoff since for them, as each day passes, any learning curve becomes too steep to climb. Learning new technologies is not much different than running. If you run a mile everyday, that mile becomes easier and easier. If you never run, a quarter mile is daunting and the end might as well be 1000 miles away.
Learning new technologies not only helps you learn the technology, but gives insight on how to approach other newer technologies. You apply lessons learned form the learning itself.
The "meta" learning is what is at stake. The corporate coder who rarely ventures beyond his drag and drop world is quick to blame new technologies when they do not behave exactly as expected. They are quick to yell out "Well it doesn't work and there are alot of bugs." These developers know their little world so well, they project that level of knowledge to absolutely everything outside that world. If they can't figure something out, that thing must be broken or too complicated. It is important to prevent that stagnation.
What is the outcome of such a developer on a given project?
I recently had to work with such a developer and here were some of the outcomes:
The project used .Net table adapters for the data access layer. Without going into the pros and cons of the technology 3 of the 4 developers agreed to use this technology. The fourth developer eventually signed on (in words) to use table adpaters to access the data in the logic layer.
First Sign
The very first issue came when in a meeting he said what we were trying to accomplish was impossible, he had never seen it doone and doubt we could do it. What was the task? Create an XML files from the table adapter, read the XML in to a table adapter, insert into the database from the table adapter. This is far from even a difficult task nevermind an impossible one. Within 30 minutes I had a working prototye for him using our project. I had actually created prodeuciton level code in a matter of minues. I am not a genius, that is what table adpaters are used for. Serialized data can be easily transformed into strongly typed datatables, generic datatables or XML.
Second Sign
Everything was going fine. For most people, this is not a sign, but I knew if I wasn't hearing anything, aomething was going wrong. The developer then tells me there is no way to set the connection string at runtime. He had not done any research, he had simply not been able to find a first glance ay of doing it. He had always set the connection string in the same way. Since it apparently couldnot be done in the same manner it was impossible. Without even looking at the code I offered several possible ways to set the connection string. I was able to do this, NOT because I had mastered the technology, but because I had taken time to learn how Microsoft does things. There were just some basic technieus which were probably in use in the table adapters.
Third Sign
One of the strength of the strongly typed datatables is that hey allow direct access to the data columns as row properties. This increased performance by elminating looking up columns by string values. However, when I looked into this developers code he was still using row.item("columnName") as opposed to row.ColumnName. By refactoring out the old way and using the new notation, I increased performance 400%. He just didn't understand how there could be a difference between an in memory object and a lookup. Since he never had to learn, he just didn't bother to learn.
The final sign
This developer had one section of code where he had to do his own data manipulation. And guess what, he reverted to what he knew. I didn't bother refactoring this code because i didn't have time, I just added an abnstracted layer to make it work within the general framework. Sometimes you just don't have enough time.
To Be Continued....(this was getting too long)
Subscribe to:
Posts (Atom)