Monday, August 13, 2007
#
After working with Enterprise and Site Search for nearly 8 years now I have learned that synonyms and the ability to apply them to a search engine is one of the most valuable tools for modifying the search experience. Any search engine worth its 'weight' will be able to pass one or many synonyms for any given search term and also allow administrators to manage the list.
Some customers ask me if they can import an entire thesaurus into a search engine. Although this is sometimes possible, I would recommend against it. This brute force approach often seems like a good idea but I think you run the risk or wasting a lot of time on unecessary term matches (more on this later) and also run the risk of generating a lot of noise. Also, you are possibly adding thousands of lookups to the index and on some search engines this can degrade performance at seek time.
Synonyms can be essential when you get involved in a search engine and start monitoring its statistics. If you begin to analyze search statistics such as top searched terms, top found terms, top not found terms, and top not clicked terms, you will find that about a third of all users on a large external site and more than half of all users on an internal site are looking for a very limited set of terms. For one external, very broad site, I was working for this amounted to 32% of all users searching for just 25 words. I have also seen on internal sites that over 80% of all users were looking for just 10 words. The other 68% in the external site and 20% in the internal site represented the famed 'long tail' where most if not all looked for a single unique term each. Many will say that these are the people that you would want to import that thesaurus for but from what I've seen many of these search are just mistakes. I think it's better to let things like phonetics (did you mean?) and other automated systems handle these 'misdirected' people.
In Microsoft Office SharePoint Server 2007, there are two areas where 'synonyms' can be applied but only one is what I would truly call a synonym:
The first is in the admin UI, under Site Settings > Search Keywords under the top level site settings. This Search Keywords section is really for creating best bets and that's what it should be used for. It's actually a great way to recommend links and direct users to content. Mondosoft's Ontolica actually enhances this quite a bit.

There is also the ability to add a synonym. The synonyms are the actual terms that are used in the query fields. The Search Keywords field will hold the term that actually gets displayed on the result page. Microsoft claims this function is for creating a definition list for particular terms. You can even add a link here. This seems just like a weaker version of best bets.

Here you can see the results of the keyword added above. Not terribly thrilling...

The real place to add synonyms is actually quite convoluted. Microsoft has a KnowledgeBase article here. On the server itself, under Drive:\Program Files\Microsoft Office Servers\12.0\Data\Applications\Application UID\Config. On my Litware VPC (provided by Microsoft) the path is C:\Program Files\Microsoft Office Servers\12.0\Data\Applications\a1d0c399-7dfe-45e4-966e-8d5437ab611e\Config.
The Application UID is the ID of the web application and is likely some unfamiliar combination of letters and numbers. If you have only one web application this will be easy.
In this folder there are two types of files: Noise words - these are the words ignored by the search engine and are generally the words that are most common and lacking meaning in the language. Examples are and, this, that, the etc. If the search engine looked for these words when people searched for them it would likely return all the pages in the index and this will not help anyone so they are ignored. The lists are held in these files.
The other files are the 'thesaurus' files for every language in the site. If multiple languages are not used, the search engine will default to tsneu.xml, the neutral file. The file looks like this out of the box:
<XML ID="Microsoft Search Thesaurus">
<!-- Commented out
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>Internet Explorer</sub>
<sub>IE</sub>
<sub>IE5</sub>
</expansion>
<replacement>
<pat>NT5</pat>
<pat>W2K</pat>
<sub>Windows 2000</sub>
</replacement>
<expansion>
<sub>run</sub>
<sub>jog</sub>
</expansion>
</thesaurus>
-->
</XML>
The first thing to do is to remove the lines commenting it out. '<!-- Commented out' and '-->'. Notice that there are tag sections in the XML file expansion 'expansion' and 'replacement'. The expansion part will add the sub fields to any query with one of the terms queried. So if you search for IE you should get the query expanded with IE5. The 'replacement' section will replace whatever 'pat' terms are queried with their 'sub' equivelants. This is good for terms that you want to make synonyms for that don't return any results.
So if I am selling windmills and we call them wind turbines in the industry, I still want people to find information when they search for windmill. So I might modify the file like this:
<XML ID="Microsoft Search Thesaurus">
<thesaurus xmlns="x-schema:tsSchema.xml">
<diacritics_sensitive>0</diacritics_sensitive>
<expansion>
<sub>wind</sub>
<sub>turbine</sub>
</expansion>
<replacement>
<pat>windmill</pat>
<pat>windmil</pat>
<sub>wind turbine</sub>
</replacement>
<expansion>
<sub>turbin</sub>
<sub>turbine</sub>
</expansion>
</thesaurus>
</XML>
I've heard there is a maximimum file size for this file of 10mb. Don't quote me on that but I do know there is a maximum so it is not possible or practical to import the entire Oxford thesuarus into the file.
Save the file and recrawl and you should be good to go. I'd love to hear if anyone has had any good or bad experiences using these files in either SPS 2003 or MOSS 2007.
Thursday, May 03, 2007
#
I went to see Steve Balmer among others speak on Microsoft Office SharePoint Server 2007 Search at a special event for Small Business Specialists here in Denmark. Most of the speaches were quite weak on the search topic including Steve's and most of the audience was more there to see Steve than to care about Search. It seemed strange to me to invite Small Business Specialists to see Steve speak on Enterprise Search but there were about 500 people attending.
At the question period after Steve's speech (I honestly can't remember what he said), people began to ask all sorts of non-search related questions. The favourite topics were about mobile devices and Vista. At one point Steve asked how many people had NOT installed Vista on their own machines yet and over half of the people put up their hands. In response to this, Steve merely said 'wow'. After a moment of silence one of the guests blurted out 'There's the Wow Effect'. Somehow, I found that poetic.
Once again, Microsoft is making a good effort to market their initiatives with little understanding about how to do it. I'm sure Steve's time could have been better spent and it would have been even better if he actually said something about Search. Even their expert (Edward O'hara from Jupiter research) only talked about the Global Search Market trends. Mind you, his danish was quite impressive.
Friday, March 30, 2007
#
One of the most important and overlooked items in a search engine's usability are the titles and descriptions on the result pages. Because any monkey can code HTML and any baboon can make a program to do it for them, html code sucks. Every page has different code and many are riddled with mistakes. For an example, go to the World Wide Web Consortium's HTML validator and run a test on www.google.com - 106 errors last time I checked.
So it's inevitable that bad titles and descriptions, or none at all, are common. The advent of Content Management Systems should have solved this but they just complicated it by using templates with THE SAME METADATA ON EVERY PAGE.
Luckily, now some are getting the point and allowing authors and editors to add or manipulate their own titles and descirptions. Great! But I get the question, what are 'best practices' for titles and descriptions for my pages?
Well, here are some guidelines.
1) If your enterprise search engine can do it (of course MondoSearch can) make Meta tags titles and descriptions specifically for it. Many companies want to put their company name in every title so global search users can identify the pages. These seems ok to me but local search doesn't want to see that. Make a <meta name="local-title" content="title"> tag and tell your enterprise search engine to pick it up and use it. Same goes for Descriptions
2) Titles should be short 1-3 words are best. People don't like to read. They just want to catch the point of the document and move on if it's not the one they are looking for. Every document usually has one overal theme - find a way to describe it. If the document is about catching rats title it 'Rat catching'.
3) A description should be a little longer than a title but not much (5-8 words). You want to describe what the document is about so the users can differentiate it with other similarly titled ones but you don't need to explain the entire document. If your rat catching document is about calling your local pest control authority to catch your rats, describe it as such.
4) Avoid marketing 'Mumbo-Jumbo'. Long descriptions about how wonderful your products are is not going to impress anyone and just confuse most - avoid it totally.
Here is an example of a site that has some pretty good titles and descriptions and they help you, usually, find the right information.
London Borough of Lewisham
Check it out! Try searching for 'Council Tax', 'Jobs', or 'Recycling' - some of the most popular queries.
Friday, March 23, 2007
#
One of the points on a slide I present in training and many presentation on typical problems with Enterprise Search engines is 'Bad Document Authoring'. I just found another example where this is a problem so thought to write a few words about it here.
Many corporations produce documents in a number of different formats. Authors use a variety of word processing or desktop publishing tools to produce documents. The most common, of course, are Microsoft Office applications like Word or Excel and Portable Document Format (PDF). Many organizations also have decided that PDF is a better format for publishing official documents because they cannot be modified and formatting is preserved.
The major problem with indexing and returning these documents as good results is that authors are really bad at adding the kind of information (metadata) that describes these documents. Often, even filenames have little to no meaning. Many believe that this is the reason why you need a search engine - to find this poorly authored and organized documents.
For us, the most common problem is returning a PDF document that has a title somewhat like this: 'Microsoft Word - Document 2384' or simply '010504ext.doc'. MondoSearch like most search engines will look at the title tag content of the documents it finds and try to use that as a title. If this doesn't exist, it will look elsewhere or try to generate the title. Many times, this is not even an opportunity because the title tag exists but with the filename of the document.
So how can you get around this? Well, there are several options:
1) Get your authors to enter meta data in their word documents. - This is probably the best method and easy to do but suffers from poor user adoption. The authors must open the properties dialogue when creating the document and type in a title and description about the document. The title should ideally be 2-5 words and the description 3-8 words long. I will likely make another post about titles on this blog so won't get more detail on this now.
2) Add titles and descriptions to the PDFdocuments. - Most PDF documents are generated by pushing a little pdf icon'd button in the corner of Word. This generates the document automatically and does not offer to add the information. Therefore, adding them to the PDF's manually is the only option. You must open the PDF in Acrobat and then click on the little arrow above the scrollbar on the right and chose document properties. Here you (or your part time monkey) can enter a title and description for the documents.
3) Use MondoSearch's pre-indexing module, Content Optimizer, to add the information to the document at crawl time. - Our Content Optimizer is a pretty powerful tool that will allow you to programmatically add meta data to documents at crawl time. If all your documents have similar patterns, you can use the rules in the Optimizer to general titles and descriptions from these patters. I've used this tool to add a lot of Metadata, ignore irrelevant content, and even boost ranking on all sorts of document types.
Although I love our Content Optimizer, the best way to solve this problem is at the source and educate authors to make documents with good metadata. Even having all the existing PDF's fixed is probably better than building all sorts of rules to compensate for bad authoring. However, if option 1 and 2 are not available to you, try out Content Optimizer. Some consulting may be needed but I'd be happy to help you out.
Monday, March 19, 2007
#
FASTforwardblog has an interesting interview with Susan Feldman, well known search researcher at IDC. She talks about research she did on advertising and referral from global search and her findings. She says that they took publicly available data on global search queries and data from sites using global search advertising and found a large 'hole' when she compared them. The 'hole' is in the number of people they expected to come from global search compared to the actual number of visitors. She estimates that over 70% of all users go directly to most sites and begin searching for the information they want instead of getting referred from global search engines. This is contrary to the belief that all web traffic is originating at Google or Yahoo. This fits my personal habits and the research that we have done. We have seen that most global referrals are very unspecific and it seems that the users are just using the global search engines as a guide to find sites. Once they have found it, they need not use the global search to find that site again but will rely on the site search.
What this means for site owners is an even bigger need to have an effective local search strategy. 70% of users are arriving at your 'front door' looking for the information instead of jumping directly to it via global search as many would have us believe.
Susan speaks of local advertising campaigns where large sites like Amazon can use their own search technology to direct content to users. This concept is exactly what MondoSearch's SearchHeader's feature is built for. When people search for things on your site, you need not only give them the content they have queried but can also direct some advertising their way. Our customer Coleman.com uses it to direct people to Coleman Powermate when they search for 'generators' as Coleman.com doesn't stock or sell generators.
.bmp)
This feature is very versatile and helps users to get to the content they are looking for or even other interesting content, based on the site owners wishes, not the search engine's algorithms.
To hear more from Susan, I suggest you attend the Enterprise Search Summit held in New York in May where she will be a speaker.
To learn more about SearchHeaders, check our site.
Wednesday, March 14, 2007
#
Lynda Moulton, over at the Gilbane Group, posted an interesting blog post on how Enterprise Search vendors don't really get the complexities of the customer engagements for which they are offering solutions. She seems frustrated by the difficulties in getting these 'wonderful' enterprise search products working. And is missing out on good documentation, service, and training. She identifies the poor service as a weakness in the vendors and I agree with her whole heartedly. However, I think it is important to outline the reasons why us vendors are like that.
1) Search vendors sell a product - It's a shrink wrapped, 'deliver with a installation wizard and go' solution which is just supposed to work on every customer's environment no matter how complex.
2) Those who purchase enterprise search tools expect to be able to have their IT admin set up and maintain this tool
3) Google sets the bar. It's built for the IT admin and gives 'good enough' results so customers expect to get that from a way more complicated tool and for the same price. And vendors know that's what they are competing against.
4) Customers drive the development, packaging and service level - you can't force customers to care about good documentation, good service, and good training. If you could, everyone would use MondoSearch. Customers compare products based on feature checklists.
5) Many vendors and customers depend on System Integrator partners to do the services part so can only hope that they have the skills to do the job and the desire to give good service.
6) Analysts promote products that can do very technical things (like index hundreds of document types) and those who can pay them a lot to do analysis.
She recommends "Frank discussions with customers that set expectations about deployment and implementation, potential bottlenecks, and the need for experienced searchers, search analysts and subject matter experts on the team with the IT group". This is all fine and dandy but from my experience, when we start to discuss these things with customers they get concerned. We have been selling search analysis tools for about 6 years now and when customers hear that they might have to spend 20 minutes a week monitoring the search engine for performance they get very worried looks on their faces. Software is supposed to make your job easier, not harder! This is why, I believe, Endeca is making headway in the market - guided navigation is easy, automatic, and promises to solve all your search woes.
So Lynda should not feel too bad. I know its frustrating to deal with vendors but not all vendors are the same and she certainly hasn't tried us all. And partially, analysts are to blame for the way vendors are seen and motivated in the marketplace.
Monday, March 12, 2007
#
Here are a couple of year old podcasts from Gartner. It's interesting to listen and think about how accurate they are a year on...
The Evolution of Information Access (Jan 3, 2006)
The Importance of Information Access (Jan 17, 2006)
The most interesting thing about analysts is that they claim they can predict the future. In a way this is a bit of a self fulfilling profecy because vendors and customers will change in reaction to their predictions, but they often have very good insight into the markets. Sometimes, things go the opposite direction though.
Have a listen to these and see if you think they are right.
Friday, March 09, 2007
#
One interesting thing about web content nowadays is that it is served almost exclusively by CMS systems. Most companies and even people with personal sites (like my own) use some sort of CMS system like Sitecore, DotNetNuke, Microsoft CMS, or EpiServer (my personal favourite) to manage their information and post it to the web.
The problem for most search engines, both local and global, is that the pages in their systems are based on templates that have the same menus and information on every page. There is usually just a small section in the middle of the page that actually has the content that the page is about. There are often even news items or advertisements in the sidebars of the templates. A lot of this recurring content also fits very well with the most important concepts of the organization. Therefore, many searches return all the pages when searching for a general concept expected on the site. This produces a lot of noise in the search results. Many of my customers ask how they can avoid this.

Some other vendors (eg. Microsoft Sharepoint) have offered the suggestion of returning a different version of the site to the search engine when it crawls by recognizing the Agent Identifier of the search engine and then returning only the content parts of the page. This causes a lot of hassle and requires some sort of programmatic intervention, sort of like a browser check.
Many years ago, before I started with Mondosoft, we had already solved the problem by inventing a special tag pair that can easily be placed around the sections of the template (or around user controls) that you don't want indexed. This tag pair was originally <noindex></noindex> but has since been changed to <!-- noindex --> <!-- /noindex -->. The change puts the HTML tag pair in comments so that other crawlers/browsers do not get confused by it and the pages are HTML standards compliant.
I know that the World Wide Web Consortium did look into this issue but didn't come up with a way to exclude specific content from crawlers. The best suggestion I could see them coming up with was having noindex as a class element in tags. This however, would screw up your design and formatting if you were using cascading style sheets (CSS).
I recommend all our customers use this tag pair if you can - you will see an immediate improvement in your search results!
Wednesday, February 28, 2007
#
I met Mike Pallot at Microsoft in Thames Valley Park in the UK yesterday. He's a really sharp guy and didn't mess about. I was late for the meeting due to a delayed flight but we still managed to conclude the meeting early with focused action items laid out. Mike pointed me to his blog for some simple advice on partnering with Microsoft and after reading it I remembered his straight, accurate, concise approach.
It's a great article for anyone interested in working better with Microsoft. Highly recommended.
Friday, February 23, 2007
#
I got a very good request to define the difference between site search or what may be extended to enterprise search and Global search. It's a very good question because many believe that the two are the same, the technologies and behaviors are the same and, therefore, the same tools can be used to do the job.
For many sites submitting a request to Google with the 'site: operator' is sufficient search. Google has indexed a few pages from their site and just having search is a novelty. For larger sites, business, or just serious organizations, this is certainly not enough. Google will return what it finds and rank it on its popularity, leaving out anything that it can't find and ignoring the needs and wishes of the sites owners.
I would say most, if not all global search engines use some sort of popularity ranking mechanism. The most famous, developed by the Google founders is called PageRank and it was developed while they were students at Stanford university before they started Google. The genius thing about PageRank is that it takes into consideration (and relies on) the fact that the internet is a community - a social network wherein members of the community recommend (link) content with value and ignore content without value. When the web began, people had 'link' pages where they put links to sites they liked. Blogs are a natural and perfect evolution of that and now people are writing about and linking to other content at an alarming rate. SEO people and most bloggers know this and try to manipulate it by spamming comment fields on other blogs and linkbaiting (writing controversial or provocative entries with the hope of getting links to it). Enough about Global Search.
Local, site, or Enterprise search does not have the 'luxury' of the social community. Also the community element within an organization does not necessarily reflect the business needs of the organization. I blog a lot on our internal SharePoint blog but mine is not the only blog. However, mine is the most frequently updated. This would mean that anything I find interesting has many more links to it than other content. So in our organization, my favorite things would get higher ranking if our search engine used PageRank.
Local search is much similar and should be very controlled by the site authors and administrators. The number of links does not often reflect the value of the content on the site or, more importantly, what the users are interested in. A perfect example is a customer built a website to track shipments for their customers. They created a lot of information on the site about shipping routes, types, etc. and links on every page to their tracking application only to find that over 30% of all searches were for one term - 'job'. They didn't even have a single job posted on the site as these things were handled by their parent company. Needless to say, popularity ranking would not have helped.
Instead local search engines rely on very simple ranking methods:
1) Frequency - how many times does the target term(s) appear on the document
2) Density - what is the frequency relative to the size of the document? A long page will naturally have more occurrences of a specific word and therefore be ranked higher
3) Position - where on the page was it found?
4) Appearance in meta tags - keywords or description tags
5) And/or, proximity, phrase search - Are all of the words on the page or do they appear close together or in a phrase.
6) Appearance in title tags - If the term is in the title, it is likely the document is about that term.
7) Forced Ranking - tagging or setting a particular page to be ranked higher.
Aside from this, there are a number of other ranking mechanisms employed by site search providers. Many of the vendors attempt to keep these things a secret but the basics are the same. One ingenious method revealed recently is the inclusion of terms on the link text from the referring page which Microsoft uses in their new Microsoft Office SharePoint Server Search.