Sphinx

From Doku

Written by Sebastián Grignoli. Nov 2, 2020

Overview

Recap of what I've been investigating and our goals regarding fast massive full text search in SKU:

  • It has to be lightning fast
  • Be able to search by partial matches and certain flexibility in general, more flexible than SQL's MATCH AGAINST and LIKE %%
  • Obtain the results ordered in a meaningful way (close matches ranking higher)
  • Have the index updated real time (no delay between a record insert or update and being able to find it in the full text index)
  • When inserting/updating records in bulk, update the index only once at the end (achieve some kind of 'debouncing')
  • Easy on the developers (enable/disable full text search via a config flag so Sku won't break in developer's local environments if the search daemon is missing)

We will give "Sphinx Search" a try. http://sphinxsearch.com

I'm still integrating it but so far everything shows that all goals can be achieved with it.

It serves searches via a native API but also via MySQL protocol, which means that we can connect to it as if it were a MySQL database and query it, albeit the syntax it uses is not SQL but SphinxQL, but that can be easily handled by us via PDO inside Laravel, just writing the raw query and executing it.

Good news is that not only we can connect to that MySQL protocol via Laravel directly (and we will be doing that too), but also MariaDB has a built in Sphinx client in the form of a "storage engine" called "SphinxSE" that allows us to create (fake) tables and whenever we query them it will proxy the query to the Sphinx server running in the background and fetch the results in a MariaDB compatible resultset that we can join with our regular table.

Querying the Sphinx index via the SphinxSE table it will give us the ID's of the first page of records matching the search, so a full text search on sku.product_attribures would look like:

SELECT * FROM 
sphinx.sku_product_attributes sx_pa 
LEFT JOIN sku.product_attributes pa ON pa.id = sx_pa.id 
WHERE sx_pa.query="simple and easy to install"; 
 -- We run this query against MariaDB service, not Sphinx.  MariaDB invokes Sphinx in the background.

Why are we going to connect to Sphinx directly if we can query it via the MariaDB internal proxy? Because we want a Real Time index, and we need to insert and update records in the index too, not just query it.

The MariaDB SphinxSE client allows querying only.

A regular (not real time) Sphinx index is declared in the config file and updated via a command line utility that reads all the source documents and creates/replaces the index.

A real time index (RT) does not have a "source" declared in the config file and cannot be built with the command line utility. It has to be populated via INSERT or REPLACE INTO statements that look like SQL statements but need to be run against the Sphinx daemon directly (by Laravel in our case). So we need to run additional queries in SKU every time we update a product or a property value, but against a secondary "mysql-like" connection to the Sphinx daemon.

Every time the ProductAttribute model updates the record, it would compare current vs old string fields and if any of those changed, send the REPLACE INTO statement to Sphinx.

Every time the ProductAttribute repository (assuming we ever have one) updates many product_attribute records in bulk we should trigger a process that would keep track of the latest "updated_at" value that was sent to Sphinx and run an update for every record that has a newer updated_at value (and track the newest value to be used next time). We shouldn't update product_attribute records in bulk without using the repository class and we must remember to call the index updater every time we do such bulk updates.

Creating and maintaining a Sphinx index on your MariaDB table.

Installing Sphinx standalone service

  • Download and install from binaries.
  • Add it to the path.
  • Create a Linux user: sphinxsearch
  • Create a folder: /var/sphinxsearch/
  • Copy config file sphinx.conf to /var/sphinxsearch/ containing: sphinx.conf example
  • Transfer ownership of folder and files to the service user:
chown -R sphinxsearch:sphinxsearch /var/sphinxsearch

Adding a MariaDB internal client for Sphinx

Explained in detail here but summarized below:

In MariaDB:

Only once, to enable the Sphinx Storage Engine -SphinxSE- that comes included in MariaDB, execute:

INSTALL SONAME 'ha_sphinx';

If successful, the command

"SHOW ENGINES;"

should include the SPHINX engine.

Then, do this once per project:

CREATE DATABASE sphinx; -- (could be named sphinx_sku or something project-related)

Then, every time we add an index to Sphinx we should do the following:

USE sphinx;
CREATE TABLE sku_attributes
(
    id BIGINT UNSIGNED NOT NULL,
    weight  INTEGER NOT NULL,
    query   VARCHAR(3072) NOT NULL,
    group_id  INTEGER,
    INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/sku_attributes";

This table is our MariaDB gateway to query the Sphinx index.

Adding another index:

USE sphinx;
CREATE TABLE sku_products
(
    id BIGINT UNSIGNED NOT NULL,
    weight  INTEGER NOT NULL,
    query   VARCHAR(3072) NOT NULL,
    group_id  INTEGER,
    INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/sku_products"

Creating an index with MySQL source.

Sphinx indexes can be Real Time (RT) or prebuilt. Both are queried the same, but need to be populated and maintained differently.

We will use RT indexes in our app, but let's start with the non-RT version which are easy to understand. I used non-RT indexes to assess the functionality that Sphinx offers before spending time integrating the RT version of them.

The Sphinx config file has 4 main sections:

  • Sources section (one per index, only for non-RT). Explains how to obtain the data to index.
  • Index section (one per index). Configures how to index them.
  • Indexer settings.
  • Searchd settings.

Prebuilt indexes need a source section which specifies how to obtain the data to be indexed. SQL is one of the available sources (as shown in the example below).

Real time (RT) indexes do not have a "source" section (will ignore it if there is one).

Example source section for a non-RT index

(Section of /var/sphinxsearch/sphinx.conf)

source sku_products
{
	type			= mysql

	sql_host		= localhost
	sql_user		= root
	sql_pass		= sebastian
	sql_db			= sku_io
	sql_port		= 3306	# optional, default is 3306

	sql_query_pre		= SET NAMES utf8

	sql_query      = SELECT id, parent_id, brand_id, sku, name, barcode, mpn FROM products

        # All non-text fields need to have their type explicitly specified:
        sql_attr_uint = parent_id
        sql_attr_uint = brand_id
}

Example index section for a non-RT index

(Section of /var/sphinxsearch/sphinx.conf)

index sku_products {
    source = sku_products
    path   = /var/sphinxsearch/sku_products.sphinx.index

    morphology   = stem_en     #, metaphone # You can add metaphone morphology if you want.
    min_word_len = 1           # Indexes all words
    blend_chars  = -           # This presumes people won't type a hyphen into the search bar: quite likely
    blend_mode   = trim_both   #
    html_strip   = 1           # Just in case anyone tries to get clever in the admin panel and use HTML


    # <substrings section>
        # Explained here: [http://sphinxsearch.com/docs/current/conf-min-infix-len.html]

        # Index substrings too, down to 3 chars but no less
        min_infix_len = 3
        # These three options are needed to differentiate full word matches from
        # infix (substring) matches and rank them higher
        dict = keywords
        index_exact_words = 1
        expand_keywords = 1
    # </substrings section>
}

Populating the non-RT index

The defined indexes will be built with the followin bash command:

cd /var/sphinxsearch
# sphinx.conf must exist in this folder
searchd --stop  #free resources by stopping the service
indexer --all   #rebuild all indexes.  Can also rebuild only one by name:  indexer sku_products
searchd         #restart the service

Creating a Real Time index.

Just omit the source part of the config file sphinx.conf for that index and include an index part.

Example index section for a Real Time index

(Section of /var/sphinxsearch/sphinx.conf)

index sku_attributes {
    type = rt
    path   = /var/sphinxsearch/sku_attributes.sphinx.index

    rt_mem_limit    = 2047M

    rt_attr_uint = id
    rt_attr_uint = product_id
    rt_attr_uint = attribute_id
    rt_field = value

   # <substrings section>
       # Explained here: [http://sphinxsearch.com/docs/current/conf-min-infix-len.html]

       # Index substrings too, down to 3 chars but no less
       min_infix_len = 3
       # These three options are needed to differentiate full word matches from
       # infix (substring) matches and rank them higher
       dict = keywords
       index_exact_words = 1
       expand_keywords = 1
   # </substrings section>

}

Populating and updating the RT index.

Querying the index.

Querying the index via direct connection to searchd daemon.

We will not use this functionality.

We will only query via the MariaDB proxy client SphinxSE.

How to use this is explained in detail in the official documentation.

Querying the index alone from within MariaDB.

SELECT * FROM sphinx.sku_attributes WHERE query="simple and easy to install";

Joining the results with MariaDB records.

Querying the index but fetching data directly from our MariaDB equivalent table:

SELECT * FROM sphinx.sku_attributes sp_a LEFT JOIN sku_io.product_attributes pa ON pa.id = sp_a.i
d WHERE sp_a.query="simple and easy to install";

Passing additional options like field weights and a higher limit:

SELECT *
 FROM products AS p 
 INNER JOIN sphinx.sku_products AS ft_p  -- meaning "fulltext products"
   ON ft_p.id = p.id 
 WHERE
   ft_p.query = "simple and easy to install;sort=extended:@weight desc;offset=0;limit=1000;weights=60,10,10,10"