GovernmentDocs.org

User Manual: Search

Wednesday, November 07, 2007



Table of Contents

I. Search Capabilities
II. Search Techniques and Syntax

Search Capabilities

GovernmentDocs.org provides enterprise search capabilities to users who want to perform sophisticated context-specific searches to find targeted results within the large set of government documents.   The search is driven by optical character recognition (OCR) technology, which translates digital images of text to machine-editable text that is stored using Extensible Markup Language (XML).  All documents uploaded to GovernmentDocs.org are OCRed and the resulting text is then indexed for retrieval.


GovernmentDocs.org’s search engine, powered by ZyLab, was developed specifically for use by law firms for e-discovery and to assist government record management.  E-discovery refers to the discovery of electronically stored information for use as evidence in a civil or criminal legal case. Therefore, GovernmentDocs.org is powered by a sophisticated search engine and has numerous techniques for customizing searches.  These techniques and the associated syntax are discussed in Section II of this manual.


There are two functional areas that also provide end-users with a significant advantage when trying to find specific information:



Relevance and Ranking i

When search engines index the Internet, they typically only want to find the best sites, or the most popular ones, as opposed to finding all potentially relevant sites. Thus search platforms that are derived from these Internet search engines focus on 100% precision and use deeply embedded relevance ranking algorithms, which can be based upon, for instance, popularity ranking (in traffic and in links).


In many legal, intelligence or investigative environments, though, “popular” hits aren’t always desired; for example, consider the complex financial transactions in a money laundering case. Users in this instance cannot miss a document that contains a “smoking gun” or provides critical details that could influence the case. All potentially relevant hits must be found, meaning 100% recall is needed, which is difficult to achieve if an engine is optimized for precision across an implementation.  GovernmentDocs.org’s search engine returns all relevant results and then arranges the set of retrieved records so that those most likely to be relevant to your request are shown to you first. There are a number of different factors that GovernmentDocs.org uses to determine relevance, such as density, frequency and breadth of match.



Hit Highlighting and Hit Navigation

Hit highlighting is defined as the search engines ability to highlight using a yellow color (similar to a yellow highlighter) where each occurrence of search terms is on a page. This feature can be used as a quick mechanism to locate the terms and determine whether a document is relevant or not. This is an important and essential feature of GovernmentDocs.org’s search engine.


It is also extremely important to be able to easily move between pages within the search results set. Without returning to the search results page, GovernmentDocs.org provides an easy mechanism to navigate to the next, or previous, page returned as part of the end user’s search.



II. Search Techniques and Syntax ii

Please review this section to learn more about the techniques and syntax that can be used to control the results provided by Governmentdocs.org’s sophisticated search engine. Almost any kind of search can be carried out. From explorative searches that will provide a broad sense of the available information to very precise searches.


Content Words and Phrases
Search statement Example of query Results
Content word Chicago chicago
Content phrase chicago cubs chicago cubs
Content phrase (with one or more noise words) Billy the Kid Billy the Kid
Two words cleveland OR detroit cleveland
detroit
cleveland, detroit
Two words cleveland AND detroit cleveland, detroit

Fuzzy Searches

Use Fuzzy search to find all occurrences of a word, including the ones that were not recognized correctly by the OCR engine. You can choose from 4 fuzzy degrees (the degree of closeness to your query). Fuzzy degree 2 is recommended for normal text searches. This provides for mistakes such as broken and joined characters. Set Fuzzy degree 3 or 4, if you search for long words.


Example query Fuzzy degree Results
computer 0 computer
computer 1 computer, commuter, compute, computter
computer 2 computer, commuter, compute, computter, computw, cumpoter, comput
computer 3 computer, commuter, compute, computter, computw, cumpoter, ounputer, cumpotor, ocnputter, compu
computer 4 computer, commuter, compute, computter, computw, cumpoter, ounputer, cumpotor, ocnputter, ccnpoter, oompotor, cunnputcr, comp

Note

The retrieval of irrelevant shorter words is prevented. For example, if the Fuzzy Degree is 4 and the search term is six characters long, the actual Degree of Fuzzy will be 0.5 X 6 = 3 and not 4.


Wild Card Searches
Example of query Results
b?rn born, barn, burn
?andy candy, dandy, sandy
sh??e shore, shade
60? 600, 601, 602, 603, 604, 605, 606, 607, 608, 609
*most most, almost
auto* auto, automobile, automotive, autobiography, autocracy, autograph
auto AND automo* auto, automobile, automotive

Boolean Operators

Boolean operators define the relationships between words or groups of words.


Boolean operator Recommended use Example of query
Results
OR Broaden your search and look for terms that have similar meaning, or refer to similar subjects. car OR transportation car
transportation
car, transportation
AND Narrow your search and look for terms that have different meanings. new england AND north dakota new england, north dakota
NOT Narrow your search and look for documents that do not contain the word after NOT. NOT cars bikes
boats
anything but
AND NOT Narrow your search and look not for a term that is often connected to your search. leaf AND NOT tree leaf
AND NOT   used cars AND NOT cars No results
AND NOT   (cars) AND NOT used cars cars
OR NOT   used cars OR NOT cars No results
OR NOT   (cars) OR NOT used cars cars
And all documents that do not contain the phrase "used cars".

Positional Operators: Context Related

Use positional operators to ensure that your search terms are contextually related. This is especially useful when searching long documents.


Positional operator Recommended use Example of query
Results
WITHIN: W/n Limit your search to words that appear within a defined range (n (max. 16382)) in either direction. client W/6 complaint "The client was determined to file a complaint."
"A complaint was filed by his client."
    blue sky w/10 green grass w/10 clear water "From high in the blue sky, he could see the green grass and the sparkling clear water of the sea."
W/n/EOS (sentence)   Minnesota W/3/EOS Maine Minnesota appears within 3 sentences of Maine, and vice versa.
W/n/ EOP (paragraph)   Minnesota W/3/EOP Maine Minnesota appears within 3 paragraphs of Maine, and vice versa.
W/n/EOG (page)   Minnesota W/3/EOG Maine Minnesota appears within 3 pages of Maine, and vice versa
Precedence: P/n Find terms in a specific order. education P/100 fitness Education precedes fitness within 100 words.
TO Search for occurrences of a term falling between two other terms. blue TO green {red} "Blue and red and green."
Only red is highlighted, blue and green function as start and end delimiters.

Precedence and Parentheses

When you use two or more operators, the order in which these operators are executed is determined by operator precedence. You can override this order by using parentheses (grouped terms). Content within parentheses is interpreted as one unit. Always use parentheses in complex search statements (more than two operators).


The following operator precedence is executed:


Number Range Operator

You can use these math operators in number range searches:


Example of query Results
"60615" 60615
60615 60615, 606154, 060615, etc.
>=65 w/10 social security Number 65 or higher within 10 words from social security.
> 21 AND high school graduate 22, high school graduate
23, high school graduate
etc.
>1 : <10 All values between 1 and 10.
1,000000000001
2
3,333333
9,9999999999999
Searches of this type take time to execute. A lack of system resources may cause the search to error out.
<> 5
is identical to
NOT 5
Documents without value 5 are retrieved.

Quorum Operator

Use Quorum operators to search for a specified number of terms within a search statement.


n of {term, term, ...}

Example of query Results
2 of {history, English, social studies, French, Dutch, German} history, social studies
Dutch, German
etc.
1 of {blue, green, red}
equals
blue OR green OR red
blue
green
red
3 of {blue, green, red}
equals
blue AND green AND red
blue, green, red

Separators

Use Separators to limit a search to a physically defined range of a text file. Separators are very useful when combined with the TO operator.


You can use these separators:


Example of query Results
experience TO EOP {(driver or chauffeur) and >= 3} Locates resumes of persons with a minimum of three years' experience as a driver.
EOP TO EOP {economic and policy} Locates a single paragraph that includes both economic and policy.
"EOG" EOG
EOG Retrieves all files with an End Of Page marker, which is, of course, every file.

Search Rules and Conventions

i The Evolution of Enterprise Search: Specialist Applications Move to the Top of the "Food Chain", By Dr. Johannes C. Scholtes, President and CEO, ZyLAB North America LLC, KMWorld, May 2007


ii ZyLab Standard User Manual ZyIMAGE 5.0, © Copyright 1997-2007 ZyLAB Technologies