Difference between revisions of "Sphinx"
Line 1: | Line 1: | ||
<small>Written by Sebastián Grignoli. Nov 2, 2020</small> | <small>Written by Sebastián Grignoli. Nov 2, 2020</small> | ||
− | Creating and maintaining a Sphinx index on your MySQL table. | + | === Overview === |
+ | |||
+ | Recap of what I've been investigating and our goals regarding fast massive full text search in SKU: | ||
+ | |||
+ | * It has to be lightning fast | ||
+ | * Be able to search by partial matches and certain flexibility in general, more flexible than SQL's MATCH AGAINST and LIKE %% | ||
+ | * Obtain the results ordered in a meaningful way (close matches ranking higher) | ||
+ | * Have the index updated real time (no delay between a record insert or update and being able to find it in the full text index) | ||
+ | * When inserting/updating records in bulk, update the index only once at the end (achieve some kind of 'debouncing') | ||
+ | * Easy on the developers (enable/disable full text search via a config flag so Sku won't break in developer's local environments if the search daemon is missing) | ||
+ | |||
+ | We will give "Sphinx Search" a try. http://sphinxsearch.com | ||
+ | |||
+ | I'm still integrating it but so far everything shows that all goals can be achieved with it. | ||
+ | |||
+ | It serves searches via a native API but also via MySQL protocol, which means that we can connect to it as if it were a MySQL database and query it, albeit the syntax it uses is not SQL but SphinxQL, but that can be easily handled by us via PDO inside Laravel, just writing the raw query and executing it. | ||
+ | |||
+ | Good news is that not only we can connect to that MySQL protocol via Laravel directly (and we will be doing that too), but also MariaDB has a built in Sphinx client in the form of a "storage engine" called "SphinxSE" that allows us to create (fake) tables and whenever we query them it will proxy the query to the Sphinx server running in the background and fetch the results in a MariaDB compatible resultset that we can join with our regular table. | ||
+ | |||
+ | Querying the Sphinx index via the SphinxSE table it will give us the ID's of the first page of records matching the search, so a full text search on sku.product_attribures would look like: | ||
+ | |||
+ | <syntaxhighlight lang="SQL"> | ||
+ | SELECT * FROM | ||
+ | sphinx.sku_product_attributes sx_pa | ||
+ | LEFT JOIN sku.product_attributes pa ON pa.id = sx_pa.id | ||
+ | WHERE sx_pa.query="simple and easy to install"; | ||
+ | -- We run this query against MariaDB service, not Sphinx. MariaDB invokes Sphinx in the background. | ||
+ | </syntaxhighlight> | ||
+ | |||
+ | Why are we going to connect to Sphinx directly if we can query it via the MariaDB internal proxy? Because we want a Real Time index, and we need to insert and update records in the index too, not just query it. | ||
+ | |||
+ | The MariaDB SphinxSE client allows querying only. | ||
+ | |||
+ | A regular (not real time) Sphinx index is declared in the config file and updated via a command line utility that reads all the source documents and creates/replaces the index. | ||
+ | |||
+ | A real time index (RT) does not have a "source" declared in the config file and cannot be built with the command line utility. It has to be populated via INSERT or REPLACE INTO statements that look like SQL statements but need to be run against the Sphinx daemon directly (by Laravel in our case). So we need to run additional queries in SKU every time we update a product or a property value, but against a secondary "mysql-like" connection to the Sphinx daemon. | ||
+ | |||
+ | Every time the ProductAttribute model updates the record, it would compare current vs old string fields and if any o those changed, send the REPLACE INTO statement to Sphinx. | ||
+ | |||
+ | Every time the ProductAttribute repository (assuming we ever have one) updates many product_attribute records in bulk we should trigger a process that would keep track of the latest "updated_at" value that was sent to Sphinx and run an update for every record that has a newer updated_at value (and track the newest value to be used next time). We shouldn't update product_attribute records in bulk without using the repository class and we must remember to call the index updater every time we do such bulk updates. | ||
+ | |||
+ | == Creating and maintaining a Sphinx index on your MySQL table. == | ||
=== Installing Sphinx alongside MariaDB === | === Installing Sphinx alongside MariaDB === |
Revision as of 11:23, 2 November 2020
Written by Sebastián Grignoli. Nov 2, 2020
Contents
- 1 Overview
- 2 Creating and maintaining a Sphinx index on your MySQL table.
Overview
Recap of what I've been investigating and our goals regarding fast massive full text search in SKU:
- It has to be lightning fast
- Be able to search by partial matches and certain flexibility in general, more flexible than SQL's MATCH AGAINST and LIKE %%
- Obtain the results ordered in a meaningful way (close matches ranking higher)
- Have the index updated real time (no delay between a record insert or update and being able to find it in the full text index)
- When inserting/updating records in bulk, update the index only once at the end (achieve some kind of 'debouncing')
- Easy on the developers (enable/disable full text search via a config flag so Sku won't break in developer's local environments if the search daemon is missing)
We will give "Sphinx Search" a try. http://sphinxsearch.com
I'm still integrating it but so far everything shows that all goals can be achieved with it.
It serves searches via a native API but also via MySQL protocol, which means that we can connect to it as if it were a MySQL database and query it, albeit the syntax it uses is not SQL but SphinxQL, but that can be easily handled by us via PDO inside Laravel, just writing the raw query and executing it.
Good news is that not only we can connect to that MySQL protocol via Laravel directly (and we will be doing that too), but also MariaDB has a built in Sphinx client in the form of a "storage engine" called "SphinxSE" that allows us to create (fake) tables and whenever we query them it will proxy the query to the Sphinx server running in the background and fetch the results in a MariaDB compatible resultset that we can join with our regular table.
Querying the Sphinx index via the SphinxSE table it will give us the ID's of the first page of records matching the search, so a full text search on sku.product_attribures would look like:
SELECT * FROM
sphinx.sku_product_attributes sx_pa
LEFT JOIN sku.product_attributes pa ON pa.id = sx_pa.id
WHERE sx_pa.query="simple and easy to install";
-- We run this query against MariaDB service, not Sphinx. MariaDB invokes Sphinx in the background.
Why are we going to connect to Sphinx directly if we can query it via the MariaDB internal proxy? Because we want a Real Time index, and we need to insert and update records in the index too, not just query it.
The MariaDB SphinxSE client allows querying only.
A regular (not real time) Sphinx index is declared in the config file and updated via a command line utility that reads all the source documents and creates/replaces the index.
A real time index (RT) does not have a "source" declared in the config file and cannot be built with the command line utility. It has to be populated via INSERT or REPLACE INTO statements that look like SQL statements but need to be run against the Sphinx daemon directly (by Laravel in our case). So we need to run additional queries in SKU every time we update a product or a property value, but against a secondary "mysql-like" connection to the Sphinx daemon.
Every time the ProductAttribute model updates the record, it would compare current vs old string fields and if any o those changed, send the REPLACE INTO statement to Sphinx.
Every time the ProductAttribute repository (assuming we ever have one) updates many product_attribute records in bulk we should trigger a process that would keep track of the latest "updated_at" value that was sent to Sphinx and run an update for every record that has a newer updated_at value (and track the newest value to be used next time). We shouldn't update product_attribute records in bulk without using the repository class and we must remember to call the index updater every time we do such bulk updates.
Creating and maintaining a Sphinx index on your MySQL table.
Installing Sphinx alongside MariaDB
Download and install from binaries.
Create a Linux user: `sphinxsearch` Create a folder /var/sphinxsearch/
Copy config file sphinx.conf to /var/sphinxsearch/ containing:
chown -R sphinxsearch:sphinxsearch /var/sphinxsearch
Adding a MariaDB internal client for Sphinx
In MariaDB:
INSTALL SONAME 'ha_sphinx'; -- (only once, to install the engine)
CREATE DATABASE sphinx; -- (only once per project, could be named sphinx_sku)
USE sphinx;
CREATE TABLE sku_attributes
(
id BIGINT UNSIGNED NOT NULL,
weight INTEGER NOT NULL,
query VARCHAR(3072) NOT NULL,
group_id INTEGER,
INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/sku_attributes";
Once every time we add an index to Sphinx. This table is our MariaDB gateway to query the Sphinx index.
Adding another index:
USE sphinx;
CREATE TABLE sku_products
(
id BIGINT UNSIGNED NOT NULL,
weight INTEGER NOT NULL,
query VARCHAR(3072) NOT NULL,
group_id INTEGER,
INDEX(query)
) ENGINE=SPHINX CONNECTION="sphinx://127.0.0.1:9312/sku_products"
Creating an index with MySQL source.
Populating the index.
Updating the index.
Creating a Real Time index.
Updating the index.
Querying the index.
Querying the index via direct connection to searchd daemon.
Querying the index alone from within MariaDB.
SELECT * FROM sphinx.sku_attributes WHERE query="simple and easy to install";
Joining the results with MariaDB records.
Querying the index but fetching data directly from our MariaDB equivalent table:
SELECT * FROM sphinx.sku_attributes sp_a LEFT JOIN sku_io.product_attributes pa ON pa.id = sp_a.i
d WHERE sp_a.query="simple and easy to install";