I Can't Find My Keys


If you’re like me, you’ve misplaced your car keys many times leading to frustration. But, in this blog, I’m referring to the keys in your data files. In most cases, your data is organized by keys to facilitate lookups and matching. These keys may be helping you to identify documents or helping you to identify customers, products or sales. Therefore, the value of your data is largely dependent on your ability to match and find the desired data sets quickly. Google Search or a product search on Amazon are prime examples.

Traditionally, matching and searching algorithms used blocking variables – grouped characteristics of the data to narrow down the list of possible match candidates. Using a relational database, 100-250 millisecond search times are considered to be fast. Many companies claim search and indexing capabilities to be among their core competencies. However, there are many expensive challenges associated with this traditional approach:

How to handle large queries:

Buy a bigger computer?  More software license costs?  What do you do when you exceed the capacity of a single machine? Clustering a relational database leads to more complexity.

How to economically handle load peaks:

Your hardware/software must be sized to handle the peaks (and then some), or you will have recurring service problems.

How do you manage disaster recovery: 

Do you maintain and synchronize a second production site – at twice the price?

Batch versus Online:

Is your solution the same for online queries versus large batch jobs? If not, this leads to inherent problems as you try to reproduce repeatable results.

Is there a better solution? Apache Solr, billed as an enterprise search platform, provides a compelling alternative approach. It is an open source solution that commoditizes search and index solutions. This platform easily accepts documents in a variety of formats including office documents, data files and pdf files and indexes them. Solr also provides a robust means of searching the content returning ranked matches in blazing fast speeds – typically under 50 milliseconds.   We have built solutions around general document content searches as well as customer profile searches. We found this to be easy to set up, and the built-in search capabilities are quite sophisticated. Moreover, Solr in a Platform as a Service (PaaS) environment easily addresses many of the shortfalls associated with a relational database and provides functionality that exceeds a traditional solution.

  • Cloud Level Scaling – Solr scales horizontally, not vertically which simply means the data is split up into more partitions, and more (inexpensive) computers. You can even set this up to adjust automatically leading to the most cost-effective solution.
  • Availability Zones – multiple availability zones is a standard feature of most PaaS solutions, so you don’t have to make any extra effort to design a high availability environment that gives you full confidence.
  • Performance – Solr is extremely fast and scalable. The solution is built for interactive use and can also be used for large batch jobs delivering consistency across use cases. Using a Solr search solution on the front of a NoSQL database opens up another wide range of high-performance solutions.
  • Fuzzy Searches – Most PaaS search solutions offer you functionality out of the box that may have been out of reach previously. For example, stemming is standard which allows you to match automatically on similar terms (e.g. VA Beach = Virginia Beach, Bob = Robert, tracking = track). Match expressions enable you to rank or boost matched records based on geographical proximity, veto rules or any custom requirement. Fuzzy matching, faceting, and suggestions are also included as standard functionality.
  • Expense – Operational costs (hardware and software) for a full solution start at under $200 per month. Compare that to your hardware, licensing and infrastructure costs.

If your current solution includes web services in front of a relational database, consider a search solution using a Solr-based engine optionally combined with a NoSQL database. Building a prototype solution on a Platform as a Service can be done in weeks with no capital outlay equating to very little investment risk. Relus is confident that we can demonstrate superior performance, substantially lower cost and improved reliability with this solution. For more information on setting up a search solution using Solr, please contact a Relus Big Data Expert today.

Always be in the Know, Subscribe to the Relus Cloud Blog!

AWSJay Duff