Read an Excerpt
Professional Microsoft SearchFAST Search, SharePoint Search, and Search Server
By Mark Bennett Jeff Fried Miles Kehoe Natalya Voskresenskaya
John Wiley & SonsCopyright © 2010 John Wiley & Sons, Ltd
All right reserved.
Chapter OneWhat Is Enterprise Search?
WHAT'S IN THIS CHAPTER?
* Defining Enterprise Search, and how it differs from Internet search portals
* Giving an overview of Enterprise Search architecture and the Microsoft Search lineup
* Characterizing the use of search within an organization.
* Exploring Search ROI and SCOE
* Answering common questions about Enterprise Search
Many people assume that "Enterprise Search" refers to search behind a corporate firewall. Although it certainly includes that, in this book we'll use a broader definition and consider Enterprise Search to be the search technology that your organization owns and controls, as opposed to the giant Internet search portals like Yahoo!, Google, or MSN/Bing.
This broad definition allows us to include and cover other search systems that power customer-facing applications and web properties that the company itself owns and controls. Such applications could include the search on a company's website home page and Tech Support area, or eCommerce shopping sites, which are also heavy users of search.
Organizations have different business objectives, and they implement search to help achieve those goals. As you'll see, Microsoft offers a wide range of products to power internal and customer-facing applications. But if you add up all the things that different organizations use search for, you come up with a pretty long list! Over the years, we've seen an amazing variety of ideas and projects, and about the only thing they have in common is being controlled by a specific company or agency, as opposed to being under the control of the giant web portals. This control issue is key; we'll come back to it again and again. If you are not happy with how Yahoo! or Google indexes your public site, there's a limited number of things you can do about it.
But if you own it (or lease it), and it's not working, you can change it. You can adjust it, tweak it, audit it, enhance it, or rip the whole darn thing out and start over! Ownership equals control!
Broadly, Enterprise Search could be thought of as all search engines except the public Yahoo!, Google, and MSN ones, since you do own and control the search engine that powers your public website or online store. And again, your usage patterns and priorities are likely different from those of the Internet portals.
This chapter introduces the concept of Enterprise Search, discussing its origins and how it differs from Internet search. It then provides a brief history of searching and discusses Microsoft's continuing commitment to improving its Search technologies, including a discussion of why Microsoft acquired the FAST ESP technologies and the current road map for integrating these technologies. You'll learn why a company would want to invest in Microsoft Enterprise Search and what some of the key components are.
WHY ENTERPRISE SEARCH?
Enterprise Search applications deliver content for the benefit of employees, customers, partners, or affiliates of a single company or organization. You're reading this book, so we imagine you already "get it." But if you didn't, we could say something like "if your company can afford to have search that's broken, either driving potential customers away or wasting countless hours of employees' time, perhaps you don't!"
Companies, government agencies, and other organizations maintain huge amounts of information in electronic form, including spreadsheets, policy manuals, and web pages, to mention just a few. Contemporary private data sets can now exceed the size of the entire Internet in the 1990s, although some organizations do not publicize their stores. The content may be stored in file shares, websites, content management systems (CMSs), or databases, but without the ability to find this corporate knowledge, managing even a small company would be difficult.
HOW ENTERPRISE SEARCH DIFFERS FROM WEB SEARCH
Search on the Internet is good; everyone knows that. But Enterprise Search often draws complaints for not performing up to expectations, and there are some fundamental reasons why.
The Enterprise Is Not Just a Small Internet
Many Enterprise Search offerings began life as a search engines to power generic Internet portal searching. You'd assume that, if you could handle the Internet, then of course you could handle a relatively puny private network; it just makes sense!
This seems like a perfectly sane and compelling argument, and this model has worked at some companies. If your Intranet has a few dozen (to a few thousand) company portals and departmental websites, which mostly contain HTML and PDF documents, this could possibly work for you.
But this assumption is usually false, and such engines have had to be adjusted to work well in the distinctly non-Internet-like corporate and government networks. To be fair, most vendors have responded to these differences with enhancements to their enterprise offerings. However, the underlying architecture and design may prove to be a fundamental mismatch for some specific search applications.
Technical Differences in Search Requirements and Technologies
Aside from data volume, there are a number of other technical differences between a company's private intranet and the Internet. There are also differences in how the infrastructure is used and functional requirements. These differences are the seeds for different software implementations. Here are some of the significant differences:
* There is usually a right document — Whereas Google finds tens of thousands of pages relevant to almost any search you could imagine, corporate searchers prefer fewer highly relevant results for a given search, and often there is only one "right" document: a project status report, a client profile, or a specific policy. If Google misses a few thousand documents, few people notice; if your corporate search misses one, users may consider it a failure.
* Security is critical — On the Internet, content is public for anyone and everyone who may find it. Companies often have many specific security requirements, from "Company Confidential" to "Limited Distribution." There may even be legal implications if a document is released to the public before a specific time and date.
* Taxonomies and vocabularies are important — Companies often have a specific vocabulary, such as project and product names, procedures, and policies. Corporations often have invested significant resources to build and maintain a taxonomy to categorize and retrieve content, often from content management systems. Taking advantage of these terms unique to an organization is critical to making retrieval work better.
* Dates are important — Internet search is generally unaware of document dates, because content on the Internet often lacks this information. If a corporate search for "annual report" doesn't return the most recent document, your users will be unhappy.
* Corporate data has structure — In corporate databases, and even in web content, companies have fields specific to the structure of corporate data. A large consulting firm may include human-authored abstracts in each report, and corporate search technology has to be able to boost documents based on relevant terms in the abstract.
The public Internet was the inspiration and proving ground for a majority of the commercial and open source search engines out there. Creating a system to index the Internet has influenced both the architecture and implementation as engineers have made hundreds of assumptions about data and usage patterns — assumptions that do not always apply behind the firewalls of corporations and agencies. There are dozens of things that make Enterprise Search surprisingly difficult and that sometimes flummox the engines that were created to power the public web.
When vendors talk about their products, features, and patents, they are usually talking about technology that was not specifically designed for the enterprise. This isn't just academic theory; as you'll see, these assumptions can actually break Enterprise Search, if not adjusted properly.
Every Intranet Search Project Is Unique
Although some engines were not created for the Internet, they were still usually targeted at specific business applications. For example, imagine an engine that was created to serve a complex parts database. Perhaps a spider and HTML filters were later added, but that was not its genesis. That engine could have certain intrinsic behaviors and limitations that don't align with other search projects, such as a heavy-duty versioned CMS search application. This doesn't mean that the engine is "bad"; it's just a question of mismatch.
ENTERPRISE SEARCH TECHNOLOGY OVERVIEW
At the most basic level, search engines share these four logical components:
* Spider and/or indexer process (AKA data prep)
* Binary full-text index (AKA the index)
* The engine that runs the searches and gives back results (AKA the engine)
* Administration and reporting
Each one of these systems is dependent on the previous one to function properly, except for administration, which controls the other three. A search engine can't run searches if there is no full-text index, and there won't be any full-text index if the documents are never fetched and indexed.
Search Components Outline
Modern search engines have further subdivided the data prep, index, and search functions into additional subsystems to achieve better modularity and extreme scalability.
An exploded component view might look like this:
Data Prep Spider Cross-Page Links Database Document Cache Fetch Web Pages Extract Links to Other Pages Scheduling Fetches and Refetching
Processing Determine Mime Type Filter Document Parse Meta Data Entity Extraction Indexing Determine Document Language Separate into Paragraphs, Sentences, and Words Calculate Stemming, Thesaurus, etc. Write to Full-Text Index
Full-Text Index Word Inversion Index Special Indexes (i.e., Soundex, Casedex, etc.) Metadata Index Word Vector Data, N-Gram User Ratings and Tags Periodically Validate and Optimize Full-Text Indexes Replicate Full-Text Indexes
Search Engine Accept initial Query from the User Preprocess Query (thesaurus, relevancy, recall, etc.) Distributed Query Check Actual Full-Text Index Merge Intermediate Query Results Calculate Relevancy Sorting and Grouping Calculate and Render Navigators Render Results to User Gather User Feedback and Tags
Administration and Reporting Managing the search platform
Even this outline is oversimplified for larger, more complex engines.
Vendors use different names and buzzwords for these search functions. For example, the act of fetching a web page, looking at the links on it, and then downloading those pages is called "spidering" by some and "crawling" by others. ESP (FAST/Microsoft's Enterprise Search Platform) further subdivides this into the fetching of the web pages and the indexing of the pages once they have been downloaded, so in ESP the enterprise crawler, document pipeline, and indexer are distinct subsystems and often reside on different machines.
In the early days of search software, if you needed to handle more data, you upgraded the machine's memory or hard drives, or upgraded to a faster machine. Most modern engines scale by adding more machines and then dividing the work among them. This division of labor is usually done by distributing these subsystems across these multiple machines, so this is an additional motivation for you to understand the various subsystems in your engine.
Federated search is the practice of having the central search engine actually not do all the work, instead "outsourcing" the user's query to other search engines and then combining the results with its own.
Vendors don't always highlight this feature. Many search licenses have a component related to the number of indexed documents, total size of indexed content, and so forth. When a search is deferred to another engine, the license doesn't include those other documents. We suspect that, in general, this is why federated search isn't pushed more heavily by vendors.
One final note here on federated search, which is itself quite a broad topic, is that Enterprise Federated Search has more intense requirements than the general federated search demos that many vendors perform. Federated Enterprise Search needs to maintain document-level security as searches are passed to other engines. This may involve mapping user credentials from one security system or domain to another. This is something that generic federated demos, which often just show combining results from two or three public web portals, don't address. There are other in-depth technical issues with enterprise-class federated search as well. Our advice is generally to go with a solution that is extensible via some type of API, so that new and unusual business requirements can be accommodated. ESP does offer such an API.
MICROSOFT'S 2010 SEARCH TECHNOLOGY ROAD MAP
In short you'll have a number of good choices:
* Entry-level — Search Server 2010 Express
* Midlevel — SharePoint Server 2010
* High-end — FAST Search for SharePoint 2010
* Multi-platform — The existing FAST ESP product line
* And, of course, the ancillary search engines embedded in various desktop applications and OSs
Microsoft is also positioning these products according to employee- versus customer-facing uses, although these are not hard and fast rules.
Microsoft will continue to embed search in specific products such as Windows and Office applications. For server-based search, however, Microsoft will offer different products designed for customer- and employee-facing applications. Customer-facing applications will include site search and eCommerce, where engaging search experiences drive revenue (hard ROI).
SharePoint and ESP will also target employee-facing applications, helping process vast amounts of information so that employees can get things done efficiently and effectively.
"With SharePoint Server 2010, Microsoft has made a major leap forward in Enterprise Search. This includes a range of choices — since great search is not a 'one-size-fits-all' endeavor." Microsoft, 2009
* Entry-level — Search Server 2010 Express is a free, downloadable standalone search offering. It incorporates many enhancements over its predecessor, Search Server 2008 Express.
* Infrastructure — SharePoint Server 2010 includes a robust search capability out of the box with many improvements from the previous version.
* High-End — Along with SharePoint Server 2010, a new product, FAST Search for SharePoint 2010, is being introduced; it uses technology from a strategic acquisition of FAST, an industry-leading search technology company.
* Multi-Platform — Another new offering, FAST Search for Internet Business 2010, is being introduced. This expands the FAST ESP product, and adds new modules for content and query processing. This offering is available for Linux as well as Windows.
The introduction of FAST Search for SharePoint provides a new choice: best-in-market Enterprise Search capabilities (based on FAST's premier search product, FAST ESP), closely integrated with SharePoint, with the TCO and ecosystem of Microsoft.
CATEGORIZING YOUR ORGANIZATION'S USE OF SEARCH — EXAMPLES
Everybody who writes about search claims that it's core, but really, how core is it? Although we also believe it's incredibly important, it's clear that it's more important to some organizations than others. The reality is there's a spectrum here. Since no two companies are exactly alike, their use of search will never be exactly the same. However, honestly evaluating your company's use of search might help with decision making down the line.
Instead of spouting all kinds of abstract rules, let's dive into some concrete examples.
Excerpted from Professional Microsoft Search by Mark Bennett Jeff Fried Miles Kehoe Natalya Voskresenskaya Copyright © 2010 by John Wiley & Sons, Ltd. Excerpted by permission of John Wiley & Sons. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.