As written some time ago, I switched from implementing fulltext search from Ferret to MySql FULLTEXT!
…. and I was quite happy with that decision for some time, MySql FULLTEXT:
- was fast enough
- seemed to bring up the results I was expecting
- it works with every charset
- after adjusting ft_min_word_len it was indexing short enough terms
- rails specific: it was just a SQL call, so easy to integrate with e.g. pagination plugins
… but after some time, adding more data, and playing around, I discovered more and more some strange behavior:
Searches that were returning plausible results in the past were suddenly returning a empty result list!
Every straw I found on the Internet, like rebuilding the index etc. did not bring any more light to this situation.
So I was a good boy and really RTFM: MySQL Fulltext manual
The search result is empty because the word “MySQL” is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large data sets, this is the most desirable behavior: A natural language query should not return every second row from a 1GB table. For small data sets, it may be less desirable.
A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular data set in which they occur. A given word may reach the 50% threshold in one data set but not another.
Aha, that’s it:
Since I was indexing e.g. countries and even tags via this index, It was most likely that there where more than 50% of datasets having this term and so they automatically were seen as stopwords.
Ok, nice standard behavior. But not for me!!!
So how to get rid of it:
The 50% threshold for natural language searches is determined by the particular weighting scheme chosen. To disable it, look for the following line in
storage/myisam/ftdefs.h:#define GWS_IN_USE GWS_PROBChange that line to this:
#define GWS_IN_USE GWS_FREQThen recompile MySQL. There is no need to rebuild the indexes in this case.
No, f*** off! I don’t want to recompile MySQL to get Fulltext working. My MySQL build from my Debian distribution is working quite fine.
Is there no other solution? Yes there is:
Users who need to bypass the 50% limitation can use the boolean search mode; see Section 11.8.2, “Boolean Full-Text Searches”.
So adding a short “IN BOOLEAN MODE” to the SQL query was the perfect solution! (… for me, hopefully using boolean mode works for you)
I’ve found mysql fulltextsearch to be very sufficiant in some cases too, no separate daemon that has to be kept running etc.
You might be interested in this:
http://github.com/ippa/acts_as_mysqlsearchable/tree/master