Wednesday, November 07, 2007
GovernmentDocs.org provides enterprise search capabilities to users who want to perform sophisticated context-specific searches to find targeted results within the large set of government documents. The search is driven by optical character recognition (OCR) technology, which translates digital images of text to machine-editable text that is stored using Extensible Markup Language (XML). All documents uploaded to GovernmentDocs.org are OCRed and the resulting text is then indexed for retrieval.
GovernmentDocs.org’s search engine, powered by ZyLab, was developed specifically for use by law firms for e-discovery and to assist government record management. E-discovery refers to the discovery of electronically stored information for use as evidence in a civil or criminal legal case. Therefore, GovernmentDocs.org is powered by a sophisticated search engine and has numerous techniques for customizing searches. These techniques and the associated syntax are discussed in Section II of this manual.
There are two functional areas that also provide end-users with a significant advantage when trying to find specific information:
When search engines index the Internet, they typically only want to find the best sites, or the most popular ones, as opposed to finding all potentially relevant sites. Thus search platforms that are derived from these Internet search engines focus on 100% precision and use deeply embedded relevance ranking algorithms, which can be based upon, for instance, popularity ranking (in traffic and in links).
In many legal, intelligence or investigative environments, though, “popular” hits aren’t always desired; for example, consider the complex financial transactions in a money laundering case. Users in this instance cannot miss a document that contains a “smoking gun” or provides critical details that could influence the case. All potentially relevant hits must be found, meaning 100% recall is needed, which is difficult to achieve if an engine is optimized for precision across an implementation. GovernmentDocs.org’s search engine returns all relevant results and then arranges the set of retrieved records so that those most likely to be relevant to your request are shown to you first. There are a number of different factors that GovernmentDocs.org uses to determine relevance, such as density, frequency and breadth of match.
Hit highlighting is defined as the search engines ability to highlight using a yellow color (similar to a yellow highlighter) where each occurrence of search terms is on a page. This feature can be used as a quick mechanism to locate the terms and determine whether a document is relevant or not. This is an important and essential feature of GovernmentDocs.org’s search engine.
It is also extremely important to be able to easily move between pages within the search results set. Without returning to the search results page, GovernmentDocs.org provides an easy mechanism to navigate to the next, or previous, page returned as part of the end user’s search.
Please review this section to learn more about the techniques and syntax that can be used to control the results provided by Governmentdocs.org’s sophisticated search engine. Almost any kind of search can be carried out. From explorative searches that will provide a broad sense of the available information to very precise searches.
| Search statement | Example of query | Results |
|---|---|---|
| Content word | Chicago | chicago |
| Content phrase | chicago cubs | chicago cubs |
| Content phrase (with one or more noise words) | Billy the Kid | Billy the Kid |
| Two words | cleveland OR detroit | cleveland detroit cleveland, detroit |
| Two words | cleveland AND detroit | cleveland, detroit |
Use Fuzzy search to find all occurrences of a word, including the ones that were not recognized correctly by the OCR engine. You can choose from 4 fuzzy degrees (the degree of closeness to your query). Fuzzy degree 2 is recommended for normal text searches. This provides for mistakes such as broken and joined characters. Set Fuzzy degree 3 or 4, if you search for long words.
| Example query | Fuzzy degree | Results |
|---|---|---|
| computer | 0 | computer |
| computer | 1 | computer, commuter, compute, computter |
| computer | 2 | computer, commuter, compute, computter, computw, cumpoter, comput |
| computer | 3 | computer, commuter, compute, computter, computw, cumpoter, ounputer, cumpotor, ocnputter, compu |
| computer | 4 | computer, commuter, compute, computter, computw, cumpoter, ounputer, cumpotor, ocnputter, ccnpoter, oompotor, cunnputcr, comp |
The retrieval of irrelevant shorter words is prevented. For example, if the Fuzzy Degree is 4 and the search term is six characters long, the actual Degree of Fuzzy will be 0.5 X 6 = 3 and not 4.
| Example of query | Results |
|---|---|
| b?rn | born, barn, burn |
| ?andy | candy, dandy, sandy |
| sh??e | shore, shade |
| 60? | 600, 601, 602, 603, 604, 605, 606, 607, 608, 609 |
| *most | most, almost |
| auto* | auto, automobile, automotive, autobiography, autocracy, autograph |
| auto AND automo* | auto, automobile, automotive |
Boolean operators define the relationships between words or groups of words.
| Boolean operator | Recommended use | Example of query |
Results |
|---|---|---|---|
| OR | Broaden your search and look for terms that have similar meaning, or refer to similar subjects. | car OR transportation | car transportation car, transportation |
| AND | Narrow your search and look for terms that have different meanings. | new england AND north dakota | new england, north dakota |
| NOT | Narrow your search and look for documents that do not contain the word after NOT. | NOT cars | bikes boats anything but |
| AND NOT | Narrow your search and look not for a term that is often connected to your search. | leaf AND NOT tree | leaf |
| AND NOT | used cars AND NOT cars | No results | |
| AND NOT | (cars) AND NOT used cars | cars | |
| OR NOT | used cars OR NOT cars | No results | |
| OR NOT | (cars) OR NOT used cars | cars And all documents that do not contain the phrase "used cars". |
Use positional operators to ensure that your search terms are contextually related. This is especially useful when searching long documents.
| Positional operator | Recommended use | Example of query |
Results |
|---|---|---|---|
| WITHIN: W/n | Limit your search to words that appear within a defined range (n (max. 16382)) in either direction. | client W/6 complaint | "The client was determined to file a complaint." "A complaint was filed by his client." |
| blue sky w/10 green grass w/10 clear water | "From high in the blue sky, he could see the green grass and the sparkling clear water of the sea." | ||
| W/n/EOS (sentence) | Minnesota W/3/EOS Maine | Minnesota appears within 3 sentences of Maine, and vice versa. | |
| W/n/ EOP (paragraph) | Minnesota W/3/EOP Maine | Minnesota appears within 3 paragraphs of Maine, and vice versa. | |
| W/n/EOG (page) | Minnesota W/3/EOG Maine | Minnesota appears within 3 pages of Maine, and vice versa | |
| Precedence: P/n | Find terms in a specific order. | education P/100 fitness | Education precedes fitness within 100 words. |
| TO | Search for occurrences of a term falling between two other terms. | blue TO green {red} | "Blue and red and green." Only red is highlighted, blue and green function as start and end delimiters. |
When you use two or more operators, the order in which these operators are executed is determined by operator precedence. You can override this order by using parentheses (grouped terms). Content within parentheses is interpreted as one unit. Always use parentheses in complex search statements (more than two operators).
The following operator precedence is executed:
You can use these math operators in number range searches:
| Example of query | Results |
|---|---|
| "60615" | 60615 |
| 60615 | 60615, 606154, 060615, etc. |
| >=65 w/10 social security | Number 65 or higher within 10 words from social security. |
| > 21 AND high school graduate | 22, high school graduate 23, high school graduate etc. |
| >1 : <10 | All values between 1 and 10. 1,000000000001 2 3,333333 9,9999999999999 Searches of this type take time to execute. A lack of system resources may cause the search to error out. |
| <> 5 is identical to NOT 5 |
Documents without value 5 are retrieved. |
Use Quorum operators to search for a specified number of terms within a search statement.
n of {term, term, ...}
| Example of query | Results |
|---|---|
| 2 of {history, English, social studies, French, Dutch, German} | history, social studies Dutch, German etc. |
| 1 of {blue, green, red} equals blue OR green OR red |
blue green red |
| 3 of {blue, green, red} equals blue AND green AND red |
blue, green, red |
Use Separators to limit a search to a physically defined range of a text file. Separators are very useful when combined with the TO operator.
You can use these separators:
| Example of query | Results |
|---|---|
| experience TO EOP {(driver or chauffeur) and >= 3} | Locates resumes of persons with a minimum of three years' experience as a driver. |
| EOP TO EOP {economic and policy} | Locates a single paragraph that includes both economic and policy. |
| "EOG" | EOG |
| EOG | Retrieves all files with an End Of Page marker, which is, of course, every file. |
i The Evolution of Enterprise Search: Specialist Applications Move to the Top of the "Food Chain", By Dr. Johannes C. Scholtes, President and CEO, ZyLAB North America LLC, KMWorld, May 2007
ii ZyLab Standard User Manual ZyIMAGE 5.0, © Copyright 1997-2007 ZyLAB Technologies