Tuesday, May 27, 2008

sphinx charset_table with unicode character folding

In a project at work I needed to have sphinx treat accented characters and their unaccented versions the same, eg: é is equivalent to e, etc. I took this list I found on the sphinx wiki and transformed it into a sphinx friendly charset_table.

Now when I do a search for the string "Héctor Lavoe" it matches the string "Hector Lavoe". Awesome!

Edit: fixed missing "
é" in first sentence.

Splitting long lines in text files with vim

Recently I needed to split long lines in a text file, and it took me a few minutes to figure out how to do this neatly in vim. In case I need it again, here's the regexp I used:

%s/\(.\{250\}[^ ]* \)/\1 \\^M/gc

250 is an arbitrary number I picked, and the "^M" part is a newline entered in vim by pressing ctrl-v, ctrl-m. Also, I split the line where there was a space, because I knew the data had frequent spaces and I didn't want to split a word.