Using a Search Engine to Find Data – Halloween Edition
By Elizabeth Thede, Special for USA Daily Times
Say you have a collection of files that you want to search for something like trick or treating. You could serially pull up each file, email, email attachment and the like in its relevant application—Microsoft Word, PowerPoint, Access, Excel, OneNote, Adobe Reader, Outlook, Exchange, also going through RAR, ZIP and other archives. And then you could browse one-by-one through each file to find all possible references to trick or treating.
But for larger volumes of data, this individual review approach would be quite cumbersome to say the least. Enter a search engine like dtSearch. dtSearch instantly searches terabytes in a completely different way. Rather than serially looking at each file in its relevant application, dtSearch heads directly to each file’s binary format.
The binary format world is like a parallel universe to the application view of files. In this parallel universe, files look completely different from how they appear inside of an application. In fact, you would be hard pressed to read any text at all from many binary formats; they mostly just look like a sea of random-looking coding. But parsing that random-looking code is the only way for a search engine to instantly search all at once across millions or even billions of files.
After parsing the binary formats, the search engine then builds an index storing each word along with the word’s location in the data. A single index can hold up to a terabyte of text. dtSearch can build any number of terabyte-size indexes, and search across all of them.
After indexing, whether in a classic network environment or a web-based environment, dtSearch supports an unlimited number of different search threads. That way, all users can concurrently do their own instant search. For each user, dtSearch then pulls up a copy of the binary file inside a browser for display with highlighted hits.
Not only can indexed search instantly search through terabytes, but the index supports over 25 different search types. Examples for dtSearch include any of the following or almost any combination of the following:
- You could search for trick or treat as a phrase, finding any document, email and the like that contains trick or treat as an exact phrase.
- Or you could turn on stemming which looks for different word variants to find either trick or treat or trick or treating or trick or treaters.
- Or you could enter a wildcard search looking for any random insertion, like trick or treat86.
- Or you could enter a Boolean “or” search finding any file, etc. that contains the word trick or the word treat. Another way to describe this type of search request is an “any words” query.
- Or you could enter a Boolean “and” search finding any documents that contain both the word trick and the word treat anywhere in the file. Another way to describe this type of search request is an “all words” query.
- Or you could enter a more complex Boolean search, such as looking for trick or treat but only in a file that also contains candy bars and does not mention gummy bears.
- Or you could look for trick or treat within X number of words in either direction from candy bars or gummy bears.
- Or you could look for trick or treat within X number of words before candy bars or gummy bears.
- Or you could apply fuzzy searching to sift through minor typographical errors such as if you misspelled trick as triick in an email. Similarly, words can also be slightly off following an optical character recognition or OCR process.
- Or you could do a concept search to automatically find not only trick or treat but synonyms of trick or treat. And these could be ordinary English language synonyms, or you can enter your own custom synonym rings.
- Or you could do a natural language search, where you could type in trick treat candy bars gummy bears and let dtSearch automatically rank all retrieved files by vector-spaced relevancy ranking. In that way, if candy is in millions of files but trick is in just a couple, trick would get a higher relevancy ranking. And relevancy ranking would also use the same mathematical formula to look at the density of the search terms in the files.
- Or you could enter all of these terms and give positive or negative variable term weightings on one or more of these terms.
- Or you could find trick or treat in a document or email but only in if the document or email also contained a valid credit card number, or a specific regular expression, or a specific hash value.
- Or you could do a search for trick or treat in certain file or email metadata and candy bars and gummy bears in other full-text data.
- Or if you are a developer you could offer a faceted search option, so that the end-user could go to the candy topic, drill down to non-chocolate treats, and then drill down further to gummy bears or licorice and then look for Halloween.
- Or if you are a developer, you could classify certain content, such that anyone can see the trick r treat files, but content that has gummy bears in certain metadata would be “eyes only” to specific end-users.
dtSearch has enterprise and developer products that can run “on premises” or on cloud platforms to instantly search terabytes of “Office” files, PDFs, emails along with nested attachments, databases and online data. Because dtSearch can instantly search terabytes, many dtSearch customers are Fortune 100 companies and federal, state and international government agencies. But anyone with lots of data to search can go to dtSearch.com and download a fully-functional 30-day evaluation version of dtSearch’s end-user product, to search for anything related to Halloween or anything else!
RELATED: Kevin Price of the Price of Business show discusses the topic with Thede on a recent interview.