The post “Generating a list of full text word breakers for SQL Server” was originally published for Leaf Node, but as that site is being wound down I’m republishing it here.
Why do you need a list of full text word breakers? SQL Server Full Text works on breaking text down into fragments we’d normally call words, and then working on those fragments. In English we know what most of those are likely to be, ‘ ‘,’.’,’,’ (space, full stop, comma) and some others. But what about other characters? What happens with a ‘? And what about in other collations? This is why a list of full text word breakers is very useful to have
This isn’t going to be a nice handy list of full text word breakers, but it will show how to quickly get such a list.
The basic principle is going to rely on sys.dm_fts_parser
. As described on the linked page this takes a string, and then breaks it down into ‘tokens’ based on the defined language, word breaker, thesaurus and stop list. As an example:
select * FROM sys.dm_fts_parser ('"the quick brown fox jumpsover"', 1033, 0, 0)
The typo is deliberate, for once!. And this returns:
So it’s returned a row for every token it can find. This example was nice and simple as it only contains words and spaces. The type was to show that without a word breaker SQL Server will just return the string without interpreting it. So that was for spaces, now just to confirm for other characters. First of all commas:
select * FROM sys.dm_fts_parser ('"quick,brown"', 1033, 0, 0)
So, that returns 2 rows as expected. How about hyphens?:
select * FROM sys.dm_fts_parser ('"quick-brown"', 1033, 0, 0)
Interesting, we get 3 rows this time. This makes sense when you think that hyphenation isn’t an exact science so some people will use them, and some won’t. So by combining both ‘sides’ of the hyphenated word SQL Server can hopefully match both uses.
So, that’s the basic theory. So this little piece of T-SQL is going to loop through all the 255 ASCII characters. For each one we’re going to use it to join 2 ‘words’, and then run that string through sys.dm_fts_parser. If the function returns more than 1 row we now that it’s found a word breaker, so we then output the character, and the character code as not all the characters are printable. You’ll also notice that code 34 throws an error, that’s because it’s ” which is a reserved character within full Text searches.
declare @i integer declare @cnt integer set @i=0 while @i<255 begin set @cnt=0 select @cnt=COUNT(1) FROM sys.dm_fts_parser ('"word1'+CHAR(@i)+'word2"', 1033, 0, 0) if @cnt>1 begin print 'this char - '+char(@i)+' - char('+convert(varchar(3),@i)+') is a word breaker' end set @i=@i+1 end
Which gives a nice long list:
this char - - char(1) is a word breaker this char - - char(2) is a word breaker this char - - char(3) is a word breaker this char - - char(4) is a word breaker this char - - char(5) is a word breaker this char - - char(6) is a word breaker this char - - char(7) is a word breaker this char - - char(8) is a word breaker this char - - char(9) is a word breaker this char - - char(10) is a word breaker this char - - char(11) is a word breaker this char - - char(12) is a word breaker this char - - char(13) is a word breaker this char - - char(14) is a word breaker this char - - char(15) is a word breaker this char - - char(16) is a word breaker this char - - char(17) is a word breaker this char - - char(18) is a word breaker this char - - char(19) is a word breaker this char - - char(20) is a word breaker this char - - char(21) is a word breaker this char - - char(22) is a word breaker this char - - char(23) is a word breaker this char - - char(24) is a word breaker this char - - char(25) is a word breaker this char - - char(26) is a word breaker this char - - char(27) is a word breaker this char - - char(28) is a word breaker this char - - char(29) is a word breaker this char - - char(30) is a word breaker this char - - char(31) is a word breaker this char - - char(32) is a word breaker this char - ! - char(33) is a word breaker Msg 7630, Level 15, State 3, Line 7 Syntax error near 'word2' in the full-text search condition '"word1"word2"'. this char - # - char(35) is a word breaker this char - $ - char(36) is a word breaker this char - % - char(37) is a word breaker this char - & - char(38) is a word breaker this char - ( - char(40) is a word breaker this char - ) - char(41) is a word breaker this char - * - char(42) is a word breaker this char - + - char(43) is a word breaker this char - , - char(44) is a word breaker this char - - - char(45) is a word breaker this char - . - char(46) is a word breaker this char - / - char(47) is a word breaker this char - : - char(58) is a word breaker this char - ; - char(59) is a word breaker this char - < - char(60) is a word breaker this char - = - char(61) is a word breaker this char - > - char(62) is a word breaker this char - ? - char(63) is a word breaker this char - @ - char(64) is a word breaker this char - [ - char(91) is a word breaker this char - \ - char(92) is a word breaker this char - ] - char(93) is a word breaker this char - ^ - char(94) is a word breaker this char - { - char(123) is a word breaker this char - | - char(124) is a word breaker this char - } - char(125) is a word breaker this char - ~ - char(126) is a word breaker this char - - char(127) is a word breaker this char - - char(129) is a word breaker this char - ‚ - char(130) is a word breaker this char - „ - char(132) is a word breaker this char - … - char(133) is a word breaker this char - † - char(134) is a word breaker this char - ‡ - char(135) is a word breaker this char - ‰ - char(137) is a word breaker this char - ‹ - char(139) is a word breaker this char - - char(141) is a word breaker this char - - char(143) is a word breaker this char - - char(144) is a word breaker this char - “ - char(147) is a word breaker this char - ” - char(148) is a word breaker this char - • - char(149) is a word breaker this char - – - char(150) is a word breaker this char - — - char(151) is a word breaker this char - › - char(155) is a word breaker this char - - char(157) is a word breaker this char - - char(160) is a word breaker this char - ¡ - char(161) is a word breaker this char - ¦ - char(166) is a word breaker this char - § - char(167) is a word breaker this char - « - char(171) is a word breaker this char - - char(173) is a word breaker this char - ± - char(177) is a word breaker this char - ¶ - char(182) is a word breaker this char - · - char(183) is a word breaker this char - » - char(187) is a word breaker this char - ¿ - char(191) is a word breaker this char - × - char(215) is a word breaker this char - ÷ - char(247) is a word breaker
As you can see there’s a lot of word breakers in there. And you can generate an even larger set if you use unicode, though this will need to loop through a lot more that 255 records, and you’ll need to use nchar()
instead of char()
.