A complete walkthrough to creating your own seach platform in mintues!
What is Apache Solr?
Apache Solr is an open-source search algorithm that is known for being fast and reliable. Solr provides search capabilities through HTTP requests at near real-time. Solr is a document-oriented search engine, meaning it uses request handlers to ingest data from documents of various types (XML, CSV, databases, MS Word, PDF). This allows for much more complicated search queries than most search algorithims.
Apache Lucene more recently released a open-source search engine that extends Lucene’s powerful indexing and search functionalities. Compared to Solr, it has better inherent scalability and a design optimal for cloud deployments. Check out this feature smackdown for a more detailed comparison of the two search engines!
There are endless applications for Solr. Companies like Instagram, Netflix, DuckDuckGo, and Aol all use Apache Solr. I use it to perform complex seaches on my personal documents, school notes, and other files!
Installing Solr
Let’s get to the fun part! First, you’ll need to make sure you have Java 1.8 (or higher) installed:
$ java -version
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
Now, you can install Apache Solr. Go to the Solr website and download the file appropriate for your operating system. Extract the Solr distribution:
$ cd ~/
$ tar zxf solr-x.y.z.tgz
Launch Solr
Now, you can launch Solr using the commands below and then open a port by typing http://localhost:8983/ into your browser.
$ cd /solr-x.y.z/solr
$ bin/solr start
*** [WARN] *** Your open file limit is currently 1024.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
*** [WARN] *** Your Max Processes Limit is currently 62866.
It should be set to 65000 to avoid operational disruption.
If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh
Waiting up to 180 seconds to see Solr running on port 8983 [|]
Started Solr server on port 8983 (pid=31325). Happy searching!
If you receive Permission Denied, run the commands again with bash infront.
Add Documents
You’ve created a search server that will contain collections of documents, or sets of data. A document can be written in JSON, XML, and many other data-intercange formats and is composed of fields that are specifc, pre-defined pieces of information of any field type. A document’s fields are extracted into an index, which is consulted during a search.
In order to add documents, we must first create a core. A core is a running instance of a Lucene index that contains all the Solr configuration files required to use it. Run the line below to create a core named test:
$ bin/solr create -c test
WARNING: Using _default configset with data driven schema functionality. NOT RECOMMENDED for production use.
To turn off: bin/solr config -c test -p 8983 -action set-user-property -property update.autoCreateFields -value false
Created new core 'test'
Now, we can add documents to our Solr server. But first, a bit about documents. A document is what the user is searching for. Solr finds documents that match the queries through the field values that are specified for every document. For now we can add Solr’s example documents that come with the installation.
$ bin/post -c test example/exampledocs/*.xml
java -classpath /home/dpatel1/solr-8.5.2/solr/dist/solr-core-8.5.2-SNAPSHOT.jar -Dauto=yes -Dc=test -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/gb18030-example.xml example/exampledocs/hd.xml example/exampledocs/ipod_other.xml example/exampledocs/ipod_video.xml example/exampledocs/manufacturers.xml example/exampledocs/mem.xml example/exampledocs/money.xml example/exampledocs/monitor2.xml example/exampledocs/monitor.xml example/exampledocs/mp500.xml example/exampledocs/sd500.xml example/exampledocs/solr.xml example/exampledocs/utf8-example.xml example/exampledocs/vidcard.xml
SimplePostTool version 5.0.0
Posting files to [base] url http://localhost:8983/solr/test/update...
Entering auto mode. File endings considered are xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file gb18030-example.xml (application/xml) to [base]
POSTing file hd.xml (application/xml) to [base]
POSTing file ipod_other.xml (application/xml) to [base]
POSTing file ipod_video.xml (application/xml) to [base]
POSTing file manufacturers.xml (application/xml) to [base]
POSTing file mem.xml (application/xml) to [base]
POSTing file money.xml (application/xml) to [base]
POSTing file monitor2.xml (application/xml) to [base]
POSTing file monitor.xml (application/xml) to [base]
POSTing file mp500.xml (application/xml) to [base]
POSTing file sd500.xml (application/xml) to [base]
POSTing file solr.xml (application/xml) to [base]
POSTing file utf8-example.xml (application/xml) to [base]
POSTing file vidcard.xml (application/xml) to [base]
14 files indexed.
COMMITting Solr index changes to http://localhost:8983/solr/test/update...
Time spent: 0:00:00.717
If you go back to your localhost network and open the test core overview page, you’ll see that it contains documents!
To add your own documents, follow the following format. The document you want to add should be located in ~/bin/post like so:
$ bin/post -c <core_name> <file_name>.<file_type>
Queries - Let's do some searching!
Queries are what is sent to a request handler, a plug-in that defines the logic used to process a search request. The request handler calls a query parser that interprets the terms and parameters in the query. A query input contains terms to search for, parameters for fine-tuning the query, parameters for controlling the response presentation. Solr’s default query parser is Standard Query Parser.
There are many types of searches you can conduct. Here are a few:
Wildcard | ? | Supports single and multiple character matches
Fuzzy | ~ | Discovers terms that are similar to the specified term without exact matches
Proximity | ~# | Looks for terms that are within a given distance (distance = number of term movements)
Range | [# TO #] | Finds documents that fall within a numerical or non-numerical range for the specified field(s). You can also use () to signify exclusive values.
We’ll be using the following parameters for basic searching. There are many more parameters that deal with more complicated features that you can find on the specific query pages on Apache Lucene. Here are a few of the most commonly used parameters:
defType || Selects the query parser to be used(default=lucene)
sort || Sorts responses based on response score or other characteristic
rows || Defines maximum number of documents viewed at a time
fq || Restricts the superset of documents that can be returned
fl || Defines the information included in a query response
Let’s say we are looking for any document with the word “green” in its name field. You would type the following into your browser. The “select” identifies the request handler for the queries. We are doing a wildcard search which is represented by the “?”. The “q” identifies the query itself and represents standard query syntax. And “name:” tells Solr to look for green in the name field.
This is a simple example. A level up is searching for a phrase. Let’s say we are looking for any mention of any type of solr analytics. You can do that with proximity search:
This will search the text of every document for every mention of the words solr and analytics within a word of each other. For example, the following phrases would qualify: solr faceted analytics, solr analytics, and analytics solr.
Hope you’ve had fun creating your very own search engine! There are endless ways to make it your own. I’d love your feedback on this tutorial and to hear about how you are using Solr search!
Happy Searching!
Comments