Sphinx
Written by Sebastián Grignoli. Nov 2, 2020
Contents
- 1 Overview
- 2 Creating and maintaining a Sphinx index on your MariaDB table.
Overview
Recap of what I've been investigating and our goals regarding fast massive full text search in SKU:
- It has to be lightning fast
- Be able to search by partial matches and certain flexibility in general, more flexible than SQL's MATCH AGAINST and LIKE %%
- Obtain the results ordered in a meaningful way (close matches ranking higher)
- Have the index updated real time (no delay between a record insert or update and being able to find it in the full text index)
- When inserting/updating records in bulk, update the index only once at the end (achieve some kind of 'debouncing')
- Easy on the developers (enable/disable full text search via a config flag so Sku won't break in developer's local environments if the search daemon is missing)
We will give "Sphinx Search" a try. http://sphinxsearch.com
I'm still integrating it but so far everything shows that all goals can be achieved with it.
It serves searches via a native API but also via MySQL protocol, which means that we can connect to it as if it were a MySQL database and query it, albeit the syntax it uses is not SQL but SphinxQL, but that can be easily handled by us via PDO inside Laravel, just writing the raw query and executing it.
Good news is that not only we can connect to that MySQL protocol via Laravel directly (and we will be doing that too), but also MariaDB has a built in Sphinx client in the form of a "storage engine" called "SphinxSE" that allows us to create (fake) tables and whenever we query them it will proxy the query to the Sphinx server running in the background and fetch the results in a MariaDB compatible resultset that we can join with our regular table.
Querying the Sphinx index via the SphinxSE table it will give us the ID's of the first page of records matching the search, so a full text search on sku.product_attribures would look like:
SELECT * FROM
sphinx.sku_product_attributes sx_pa
LEFT JOIN sku.product_attributes pa ON pa.id = sx_pa.id
WHERE sx_pa.query="simple and easy to install";
-- We run this query against MariaDB service, not Sphinx. MariaDB invokes Sphinx in the background.
Why are we going to connect to Sphinx directly if we can query it via the MariaDB internal proxy? Because we want a Real Time index, and we need to insert and update records in the index too, not just query it.
The MariaDB SphinxSE client allows querying only.
A regular (not real time) Sphinx index is declared in the config file and updated via a command line utility that reads all the source documents and creates/replaces the index.
A real time index (RT) does not have a "source" declared in the config file and cannot be built with the command line utility. It has to be populated via INSERT or REPLACE INTO statements that look like SQL statements but need to be run against the Sphinx daemon directly (by Laravel in our case). So we need to run additional queries in SKU every time we update a product or a property value, but against a secondary "mysql-like" connection to the Sphinx daemon.
Every time the ProductAttribute model updates the record, it would compare current vs old string fields and if any of those changed, send the REPLACE INTO statement to Sphinx.
Every time the ProductAttribute repository (assuming we ever have one) updates many product_attribute records in bulk we should trigger a process that would keep track of the latest "updated_at" value that was sent to Sphinx and run an update for every record that has a newer updated_at value (and track the newest value to be used next time). We shouldn't update product_attribute records in bulk without using the repository class and we must remember to call the index updater every time we do such bulk updates.
Creating and maintaining a Sphinx index on your MariaDB table.
Installing Sphinx
- Download and install from binaries
$ mkdir /usr/lib/sphinx-3.3.1
$ cp -R * /usr/lib/sphinx-3.3.1/
- Add it to the path.
- Create a Linux user: sphinxsearch
useradd sphinxsearch
- Create a folder for Sphinx's config and data
$ mkdir /var/sphinxsearch/
- Copy config file sphinx.conf to /var/sphinxsearch/ containing: sphinx.conf example
$ cp /var/www/dev.sku.io/current/app/Lib/SphinxSearch/sphinx.conf /var/sphinxsearch
$ chown sphinxsearch:sphinxsearch /var/sphinxsearch -R
- Transfer ownership of folder and files to the service user:
chown -R sphinxsearch:sphinxsearch /var/sphinxsearch
At this point Sphinx should be able to run and serve the configured databases:
$ sudo -u sphinxsearch searchd --config /var/sphinxsearch/sphinx.conf
Sphinx 3.3.1 (commit b72d67b)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/var/sphinxsearch/sphinx.conf'...
listening on all interfaces, port=9312
listening on all interfaces, port=9306
precaching index 'sku_products'
precaching index 'sku_product_attributes'
precached 2 indexes using 2 threads in 0.0 sec
Setting up a Systemd service
Copy app/Lib/SphinxSearch/sphinxsearch.service (from SKU.io repo) into /lib/systemd/system, or create it directly containing the following code:
[Unit]
Description=Sphinx Search Service (for SKU searches)
After=network.target
StartLimitIntervalSec=0
[Service]
Type=forking
User=sphinxsearch
Group=sphinxsearch
KillMode=process
KillSignal=SIGTERM
SendSIGKILL=no
LimitNOFILE=infinity
User=sphinxsearch
ExecStart=/usr/bin/env searchd --config /var/sphinxsearch/sphinx.conf
[Install]
WantedBy=multi-user.target
Alias=sphinxm@.service
Alias=searchdm@.service
Test it by running:
$ sudo systemctl start sphinxsearch
And check the outcome:
$ systemctl status sphinxsearch
● sphinxsearch.service - Sphinx Search Service (for SKU searches)
Loaded: loaded (/lib/systemd/system/sphinxsearch.service; linked)
Active: active (running) since Mon 2020-11-09 08:25:31 PST; 8min ago
Process: 26565 ExecStart=/usr/bin/env searchd --config /var/sphinxsearch/sphinx.conf (code=exited, status=0/SUCCESS)
Main PID: 26237 (searchd)
CGroup: /system.slice/sphinxsearch.service
└─26237 searchd --config /var/sphinxsearch/sphinx.conf
Nov 09 08:25:31 cricket systemd[1]: Started Sphinx Search Service (for SKU searches).
Nov 09 08:25:31 cricket env[26565]: Sphinx 3.3.1 (commit b72d67b)
Nov 09 08:25:31 cricket env[26565]: Copyright (c) 2001-2020, Andrew Aksyonoff
Nov 09 08:25:31 cricket env[26565]: Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
To enable the service (to be automatically started at boot):
$ systemctl enable sphinxsearch
Created symlink from /etc/systemd/system/sphinxm@.service to /lib/systemd/system/sphinxsearch.service.
Created symlink from /etc/systemd/system/searchdm@.service to /lib/systemd/system/sphinxsearch.service.
Created symlink from /etc/systemd/system/multi-user.target.wants/sphinxsearch.service to /lib/systemd/system/sphinxsearch.service.
Adding configuration options to Sku / Laravel
In .env file:
SPHINX_ENABLED=true
SPHINX_GATEWAY_DB=sphinx # For querying. MariaDB sees Sphinx as a database
SPHINX_CONNECTION=sphinx # For updating the index
SPHINX_HOST=127.0.0.1
SPHINX_PORT=9306
# Can enable/disable indexes individually:
SPHINX_INDEXES=sku_products,sku_product_attributes # (comma separated list)
Adding a MariaDB internal client for Sphinx
Via Sku artisan commands
$ php artisan sphinx:install-engine
SphinxSearch gateway engine SphinxSE was installed in MariaDB.
SphinxSearch gateway database "sphinx" was created in MariaDB.
$ php artisan sphinx:index-setup
Setting up index sku_products.
Setting up index sku_product_attributes.
Done.
Directly via MariaDB
This is what the artisan commands in the previous section are actually doing. This means that if Sphinx is correctly configured in Sku and the Artisan commands above ran without error we wont need to perform the steps in this section.
Explained in detail here but summarized below:
In MariaDB:
Only once, to enable the Sphinx Storage Engine -SphinxSE- that comes included in MariaDB, execute:
INSTALL SONAME 'ha_sphinx';
If successful, the command "SHOW ENGINES;" should include the SPHINX engine.
Then, do this once per project:
CREATE DATABASE sphinx; -- (could be named sphinx_sku or something project-related)
Then, every time we add an index to Sphinx we should do the following:
USE sphinx;
CREATE TABLE sku_attributes
(
id BIGINT UNSIGNED NOT NULL,
weight INTEGER NOT NULL,
query VARCHAR(3072) NOT NULL,
group_id INTEGER,
INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/sku_attributes";
This table is our MariaDB gateway to query the Sphinx index.
Adding another index:
USE sphinx;
CREATE TABLE sku_products
(
id BIGINT UNSIGNED NOT NULL,
weight INTEGER NOT NULL,
query VARCHAR(3072) NOT NULL,
group_id INTEGER,
INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/sku_products"
Creating an index with MySQL source.
Sphinx indexes can be Real Time (RT) or prebuilt. Both are queried the same, but need to be populated and maintained differently.
We will use RT indexes in our app, but let's start with the non-RT version which are easy to understand. I used non-RT indexes to assess the functionality that Sphinx offers before spending time integrating the RT version of them.
The Sphinx config file has 4 main sections:
- Sources section (one per index, only for non-RT). Explains how to obtain the data to index.
- Index section (one per index). Configures how to index them.
- Indexer settings.
- Searchd settings.
Prebuilt indexes need a source section which specifies how to obtain the data to be indexed. SQL is one of the available sources (as shown in the example below).
Real time (RT) indexes do not have a "source" section (will ignore it if there is one).
Example source section for a non-RT index
(Section of /var/sphinxsearch/sphinx.conf)
source sku_products
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = sebastian
sql_db = sku_io
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET NAMES utf8
sql_query = SELECT id, parent_id, brand_id, sku, name, barcode, mpn FROM products
# All non-text fields need to have their type explicitly specified:
sql_attr_uint = parent_id
sql_attr_uint = brand_id
}
Example index section for a non-RT index
(Section of /var/sphinxsearch/sphinx.conf)
index sku_products {
source = sku_products
path = /var/sphinxsearch/sku_products.sphinx.index
morphology = stem_en #, metaphone # You can add metaphone morphology if you want.
min_word_len = 1 # Indexes all words
blend_chars = - # This presumes people won't type a hyphen into the search bar: quite likely
blend_mode = trim_both #
html_strip = 1 # Just in case anyone tries to get clever in the admin panel and use HTML
# <substrings section>
# Explained here: [http://sphinxsearch.com/docs/current/conf-min-infix-len.html]
# Index substrings too, down to 3 chars but no less
min_infix_len = 3
# These three options are needed to differentiate full word matches from
# infix (substring) matches and rank them higher
dict = keywords
index_exact_words = 1
expand_keywords = 1
# </substrings section>
}
Populating the non-RT index
The defined indexes will be built with the followin bash command:
cd /var/sphinxsearch
# sphinx.conf must exist in this folder
searchd --stop #free resources by stopping the service
indexer --all #rebuild all indexes. Can also rebuild only one by name: indexer sku_products
searchd #restart the service
Creating a Real Time index.
Just omit the source part of the config file sphinx.conf for that index and include an index part.
Example index section for a Real Time index
(Section of /var/sphinxsearch/sphinx.conf)
index sku_attributes {
type = rt
path = /var/sphinxsearch/sku_attributes.sphinx.index
rt_mem_limit = 2047M
rt_attr_uint = id
rt_attr_uint = product_id
rt_attr_uint = attribute_id
rt_field = value
# <substrings section>
# Explained here: [http://sphinxsearch.com/docs/current/conf-min-infix-len.html]
# Index substrings too, down to 3 chars but no less
min_infix_len = 3
# These three options are needed to differentiate full word matches from
# infix (substring) matches and rank them higher
dict = keywords
index_exact_words = 1
expand_keywords = 1
# </substrings section>
}
Populating and updating the RT index.
RT indexes are maintained by our app, Sphinx do not auto sync the MariaBD table with it's indexes.
For that reason, populating an RT index for the first time or repopulating it completely must be done via our Artisan command:
$ php artisan sphinx:index-rebuild
Rebuilding index sku_products.
Rebuilding index sku_product_attributes.
Done.
Waiting for SphinxSearch async tasks to finish.
This must be done manually on initial setup. It can also be scheduled to periodically run in order to resync the index with the MariaDB tables.
Querying the index.
Querying the index via direct connection to searchd daemon.
We will not use this functionality.
We will only query via the MariaDB proxy client SphinxSE.
How to use this is explained in detail in the official documentation.
Querying the index alone from within MariaDB.
SELECT * FROM sphinx.sku_attributes WHERE query="simple and easy to install";
Joining the results with MariaDB records.
Querying the index but fetching data directly from our MariaDB equivalent table:
SELECT * FROM sphinx.sku_attributes sp_a LEFT JOIN sku_io.product_attributes pa ON pa.id = sp_a.i
d WHERE sp_a.query="simple and easy to install";
Passing additional options like field weights and a higher limit:
SELECT *
FROM products AS p
INNER JOIN sphinx.sku_products AS ft_p -- meaning "fulltext products"
ON ft_p.id = p.id
WHERE
ft_p.query = "simple and easy to install;offset=0;limit=1000"
Querying from SKU
To add a search to an Eloquent query builder we use the SphinxSearh::filter() method. We need to pass the query builder to add the filtering to, the name of the index we want to filter by, and the name of the field to match with the ID's that Sphinx returns (fk_field).
SphinxSearch::filter($builder, 'sku_products', ['fk_field'=>'products.id','query'=>$query]);
For the fulltext index "sku_products" the fk_field will usually be "products.id", because the ID of the sphinx resulting records will be a reference to that ID, but we could also join that particular fulltext index to another table that has that same value stored in a column, for instance: "product_attributes.product_id".
--- a/app/Http/Controllers/SupplierProductController.php
+++ b/app/Http/Controllers/SupplierProductController.php
@@ -11,6 +11,7 @@ use App\Response;
+use App\Lib\SphinxSearch\SphinxSearch;
class SupplierProductController extends Controller
{
@@ -47,7 +48,12 @@ class SupplierProductController extends Controller
if($query = $request->query('query')){
// And an 'or' filter for product name and product sku
$supplierProducts->whereHas('product', function(Builder $builder) use ($query) {
- return $builder->where('products.name', 'like', $query . '%')->orWhere('products.sku', 'like', $query . '%');
+ if (SphinxSearch::indexEnabled('sku_products')) {
+ SphinxSearch::filter($builder, 'sku_products', ['fk_field'=>'products.id','query'=>$query]);
+ } else {
+ $builder->where('products.name', 'like', $query . '%')->orWhere('products.sku', 'like', $query . '%');
+ }
+ return $builder;
});
}