Read an Excerpt
SharePoint Server 2010 Enterprise Content Management
By Todd Kitta Brett Grego Chris Caplinger Russ Houberg
John Wiley & SonsCopyright © 2011 John Wiley & Sons, Ltd
All right reserved.
Chapter OneWhat Is Enterprise Content Management?
WHAT'S IN THIS CHAPTER?
* Defining ECM as used by this book
* Gaining a historical perspective of ECM
* Defining the components of an ECM system
Considering that this is a book both by and for architects and developers, devoting an entire chapter to talking about the enterprise content management (ECM) industry and trying to define it, rather than just jumping into the bits and bytes that you probably bought the book for, might seem strange. However, by introducing ECM as part of an industry, instead of describing how the SharePoint world perceives it, we hope to provide a perspective that wouldn't otherwise be possible if you make your living inside the SharePoint ecosystem.
ECM, within or outside of the SharePoint world, seems to be a much-abused abbreviation used to describe a variety of different technologies. Of course, people often adopt new or existing terms, applying their own twist to the original meaning, and this is certainly the case with ECM. The difficult part is determining which meaning is actually correct. Sometimes even the words representing the initials are changed. For example, in the halls of our own company, sometimes "electronic" is used instead of "enterprise." In other cases, ECM is confused with specific technologies that are part of it, such as DMS (Document Management System), IMS (Image Management System) or WCM (Web Content Management).
Clearly, ECM means a lot of different things to a variety of people. There is no doubt that some readers of this book will think something is missing from the definition, while other readers will find something included that does not fall into their own definition. That being said, this chapter introduces ECM not necessarily from a SharePoint perspective, but from a historical perspective; then it provides an overview of the components of an ECM system. You can skip this information, but we believe it is important to clarify the problems we are trying to solve, rather than just write code based on our own assumptions.
INTRODUCTION TO ECM
The "content" aspect of enterprise content management can refer to all kinds of sources, including electronic documents, scanned images, e-mail, and web pages.
This book uses the definition of ECM from the Association for Information and Image Management (AIIM) International, which can be found on their website at www.aiim.org:
Enterprise Content Management (ECM) is the strategies, methods, and tools used to capture, manage, store, preserve, and deliver content and documents related to organizational processes. ECM tools and strategies allow the management of an organization's unstructured information, wherever that information exists.
As this definition states, ECM is not really a noun. That is, it's not something as simple as an e-mail system or a device like a scanner, but rather an entire industry for capturing and managing just about any type of content. The key to the definition is that this content is related to organizational processes, which discounts information that is simply created but never used.
Moreover, ECM is meaningless without the tools that accompany it. You might say that the tools that solve your content problem also define it. This idea is explored in the next section, and hopefully clarified by a short history of a few of the technologies involved.
A HISTORICAL PERSPECTIVE
Although the term ECM is relatively new, many of the components that make up an ECM system started appearing in the 1970s. The world of information systems was vastly different 3040 years ago. The Internet as we know it did not exist, the cost to store data was astronomical compared to today, server processing power was a mere fraction of what it is today, and desktop computers didn't even exist.
The history of ECM can be traced back to several technologies that formed that first stored and managed electronic content: document imaging, electronic document management, computer output to laser disc (COLD), and of course workflow, which formed the business processes.
As evidenced by the first systems to take the management and processing of documents seriously, paper was one of the first drivers. These systems were often referred to as electronic document management or document imaging systems. By scanning paper and storing it as electronic documents, organizations found a quick return on investment in several ways:
* It reduced the square footage needed to store paper.
* It resulted in faster execution of paper-based processes by electronic routing.
* It eliminated the time it took to reproduce lost documents.
* It reduced overhead because paper documents could be retrieved electronically.
In addition to a reduction in manpower, there were other benefits to storing paper electronically — namely, security and risk benefits, which preceded regulations such as Health Insurance Portability and Accountability Act (HIPAA) and Sarbanes-Oxley by more than a decade. Some of these included the following:
* Password protection of documents
* Enterprise security restraint brought about by secure networks
* Management of records needed for legal holds
* Management of the document life cycle, such as retention periods
* Audit information about the document life cycle and requests about the document
The first document imaging systems for commercial consumption became available in the early 1980s, and they quickly started to replace the previous technology for removing paper from organizations, which was microfiche. Billions of documents were stored on microfiche, but indexes and location data were often stored in databases. Conversions of these systems to document imaging are surely still being handled today.
The ability to scan existing paper documents in order to create electronic documents, as discussed in the next section, led to the vision of a "paperless office," a commonly used phrase by the end of the century. Of course, this lofty and often pursued goal of a paperless office has yet to materialize, and paper is still the original driver behind many business processes. As shown in Figure 1-1, focusing on paper is a good starting point to quickly begin realizing the benefits of an ECM system.
The invention of computer-based word processors (in the 1970s) created the need for a way to store and quickly retrieve these documents. Electronic documents share similarities with document imaging systems, yet they are unique in that they are typically dynamic; that is, they often require ongoing modification, whereas scanning paper was typically performed for archiving purposes.
The first electronic documents were created through word processing software, driven in the late 1970s by WordStar and Word Perfect. Although the former has been abandoned, WordPerfect still exists today and is part of an office suite from Corel.
Soon after personal computers and electronic word processors hit the market, electronic spreadsheets became available, beginning with VisiCalc, followed by Lotus 1-2-3 and eventually Microsoft Excel. Spreadsheet documents are now as commonplace as word processing documents.
Today, electronic documents exist in countless types and formats, ranging from simple ASCII text files to complex binary structures.
COLD/Enterprise Report Management
The widespread use of computers, beginning with the large mainframes, resulted in an unprecedented use of paper. Early computers all over the world started producing reports, typically on what is known as green bar paper. As the need for information from both mainframe and mini computers grew, so did the need for computer-generated reports. Necessary at first because structured methods for viewing data electronically did not exist, this excessive use of paper continued to plague organizations into the 1990s and even into this millennium.
Out of this problem grew a solution coined computer output to laser disc (COLD). Instead of generating paper, these reports could be handled in a type of electronic content management system, typically storing the ASCII data and rendering it onto monitors. These systems enabled not only search and retrieval of the reports, but the addition of annotations, and of course printing of the documents when using a monitor is not adequate.
The term COLD was eventually replaced by enterprise report management when magnetic storage replaced the early optical storage systems.
Business Process Management/Workflow
Storing content electronically was a great step forward, but moving content to a digital medium quickly put the content in front of the right user at the right time.
The first business process management (BPM) systems, launched in the mid 1980s and called workflow systems, were created by the same companies that brought document imaging systems to market. These early systems were far less complex than the BPM systems used today, however, as they primarily enabled content to be put into queues to be processed by the same workers that processed the paper.
It was almost 10 years later when the first graphical components became available for creating complex workflow maps. This was the beginning of BPM as we know it today, which enables organizations to create, store, and modify business processes.
In order to understand ECM, it is necessary to understand the common components that comprise such as system. The following sections provide a brief overview of these components, which, like the definition for ECM provided earlier, have been defined by AIIM.
Capture is the process of gathering the data, regardless of the source, including classification, indexing (sometimes called tagging), and rendition. These tasks are required before storage is possible, in order to understand what type of content is being managed, which keywords will be used to search for the content, and to ensure that the content is in a form that can be easily retrieved and viewed later.
Paper is still the primary driver of ECM. The reason is simple: Because of the volume of paper that most companies need to handle, efficiently managing that paper can provide the greatest return on investment. Figure 1-2 gives you some idea of the cost of paper in an enterprise.
Paper capture is primarily done by using document scanners specifically built for the purpose. These scanners can capture both small and large batches of paper. Paper documents are typically divided into three categories: structured, semi-structured, and unstructured.
Structured documents typically represent forms such as tax documents, applications, or other preexisting forms. Because they are always the same, these documents are usually the easiest to automatically extract information from. Technology such as Zonal OCR can be easily applied to structured forms because the key data always exists in the same place.
Semi-structured documents are similar to structured documents, but they are different enough that structured zones can no longer be used to extract data. A common type of semi-structured document is the invoice. Most invoices are similar enough to be recognized as such, but each company designs its invoice with enough nuances to distinguish it from others, so these documents require either manual manipulation of the data or more intelligent automation.
Unstructured content represents the majority of the information in the average corporate enterprise. Almost all human correspondence is a good example of unstructured content. Although this content ends up being stored as the same content type in each company, on a per-page basis they have very little in common. Manual indexing is typically required for this type of data.
Developing applications to handle all the different types of paper that may come into an enterprise is a daunting task. Fortunately, toolkits are available to drive most document scanners on the market today, and these are normally compliant with either (or both) the TWAIN and ISIS driver standards. Although the physical process of scanning a piece of paper is simple, building a good process for either manually entering information or automatically extracting it is difficult, and probably not cost effective for custom applications.
Because this book is about ECM and (Microsoft) SharePoint, it focuses on the most common type of electronic documents that need to be managed: (Microsoft) Office documents. These documents are created using word processing software, spreadsheet software, presentation software, and so on. These documents are typically pre-classified on creation, as they frequently start from a template; therefore, extracting data for searching is often overlooked. Although each word from the document can be added to search indexers, it makes more sense to use specific keywords to identify the document. Sometimes pre-identified form fields are used, but often the most important data is keyed by hand before sending the document to storage.
Capturing e-mails into an ECM system is becoming an increasingly common scenario. E-mail is often used to drive a business process, as it is becoming an acceptable form of correspondence in most organizations. Although some data can be indexed automatically, such as the sender, receiver, and subject, as shown in Figure 1-3, it can be difficult to extract the useful information contained in the body of the message, and manual intervention is usually required.
E-mail attachments are often more important than the e-mail message itself. Although Microsoft Outlook and other e-mail clients are improving ECM integration, extracting the attachment and exporting it to an ECM system usually requires specialized software in order to properly tag the content with searchable data.
As mentioned earlier, technology originally known as computer output to laser disc (COLD) and later as enterprise report management enables computer reports to be parsed into electronic files. Classifying and indexing these documents is typically automatic because they are in a form that is very structured.
Like the handling of paper, building a system for handling enterprise reports is most likely not cost effective when you compare the needed functionality versus the difficulty of obtaining it. Consider being able to read EDI or other electronic streams and extract the necessary data from them. Also, although data storage may not be an issue, you must consider how it will be displayed to users in a readable format.
It was initially believed that the goal of the paperless office would be achieved with the help of electronic forms. After all, the form templates provided in many applications, such as InfoPath, enable users to fill out preexisting fields and submit them directly to a content management system. With the type of content already known and the data being put into electronic form as it was gathered, it stood to reason that the paper forms could gradually disappear. However, human habits die hard. It may take another generation of computer users, who have been raised from birth with computers and who use them for everything from social networking to bill-paying, to realize the truly paperless office.
Although the most common types of capture have been identified here, there are many other possible data sources and data types. Multimedia, XML, and EDI are other well-known data formats that can arrive from many different sources. Indeed, just about any type of data can be consumed in an ECM system.
Store and Preserve
The store and preserve components of an ECM system are very similar; storage traditionally pertains to the temporary location of content, whereas preservation refers to long-term storage. In the past these were separated because online storage was costly. Content was usually stored in the temporary location only during its active life cycle, when it was frequently accessed as part of the business process. Once the active life cycle was complete, content would move to long-term storage, known as offline storage or nearline storage, which was much less expensive. The term "offline" reflects the fact that the content wasn't accessible without human intervention; the term "nearline" typically refers to optical discs that were brought online automatically, such as in the case of a jukebox. The following sections describe both the software and hardware components of storage and preservation of content.
Excerpted from SharePoint Server 2010 Enterprise Content Management by Todd Kitta Brett Grego Chris Caplinger Russ Houberg Copyright © 2011 by John Wiley & Sons, Ltd. Excerpted by permission of John Wiley & Sons. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.