Tuesday, May 27, 2008

sphinx charset_table with unicode character folding

In a project at work I needed to have sphinx treat accented characters and their unaccented versions the same, eg: é is equivalent to e, etc. I took this list I found on the sphinx wiki and transformed it into a sphinx friendly charset_table.

Now when I do a search for the string "Héctor Lavoe" it matches the string "Hector Lavoe". Awesome!

Edit: fixed missing "
é" in first sentence.

7 comments:

Jökull said...

Thank you. Just made my day.

datadevil said...

thanks, I tried the same but somehow I couldn't get the format right, dunno why, but copy pasting it from your example works fine..probably an afternoon thing but still helpful!

MrBrown said...

Works like a charm, made my day too. Thanks for the tips!

geek said...

Thanks for this. But what about searching with punctuation? Currently I am trying to search for: aaa (bbb)

.. and it doesn't match with anything. It should.

Do I need to add quotes, brackets etc to the charset_table or ignore_words?

Jake said...

How would this scale with very large systems? Do you think it would cause significant delays in fetching?

Anonymous said...

Thanks a lot ... I was just using sphinx in a metalanguage site [chinese, armenian , japanese..] ..got nut until I find this nice resource... thanks a lot ..

Anonymous said...

hi!
could you please repost your charset_table? (link in post is broken)
thanks!