Begin Rescue: sphinx charset_table with unicode character folding

Tuesday, May 27, 2008

sphinx charset_table with unicode character folding

In a project at work I needed to have sphinx treat accented characters and their unaccented versions the same, eg: é is equivalent to e, etc. I took this list I found on the sphinx wiki and transformed it into a sphinx friendly charset_table.

Now when I do a search for the string "Héctor Lavoe" it matches the string "Hector Lavoe". Awesome!

Edit: fixed missing "é" in first sentence.

7 comments:

Jökull Sólberg Auðunsson said...: Thank you. Just made my day.; July 2, 2008 at 7:25 AM
Unknown said...: thanks, I tried the same but somehow I couldn't get the format right, dunno why, but copy pasting it from your example works fine..probably an afternoon thing but still helpful!; December 4, 2008 at 7:23 AM
MrBrown said...: Works like a charm, made my day too. Thanks for the tips!; March 24, 2009 at 11:11 PM
geek said...: Thanks for this. But what about searching with punctuation? Currently I am trying to search for: aaa (bbb)

.. and it doesn't match with anything. It should.

Do I need to add quotes, brackets etc to the charset_table or ignore_words?; October 13, 2009 at 5:13 AM
Unknown said...: How would this scale with very large systems? Do you think it would cause significant delays in fetching?; February 26, 2010 at 11:22 AM
Anonymous said...: Thanks a lot ... I was just using sphinx in a metalanguage site [chinese, armenian , japanese..] ..got nut until I find this nice resource... thanks a lot ..; August 1, 2010 at 1:39 AM
Anonymous said...: hi!
could you please repost your charset_table? (link in post is broken)
thanks!; April 13, 2011 at 3:23 PM