Generating a list of full text word breakers for SQL Server

The post “Generating a list of full text word breakers for SQL Server” was originally published for Leaf Node, but as that site is being wound down I’m republishing it here.

Why do you need a list of full text word breakers? SQL Server Full Text works on breaking text down into fragments we’d normally call words, and then working on those fragments. In English we know what most of those are likely to be, ‘ ‘,’.’,’,’ (space, full stop, comma) and some others. But what about other characters? What happens with a ‘? And what about in other collations? This is why a list of full text word breakers is very useful to have

This isn’t going to be a nice handy list of full text word breakers, but it will show how to quickly get such a list.

The basic principle is going to rely on sys.dm_fts_parser. As described on the linked page this takes a string, and then breaks it down into ‘tokens’ based on the defined language, word breaker, thesaurus and stop list. As an example:

select * FROM sys.dm_fts_parser ('"the quick brown fox jumpsover"', 1033, 0, 0)

The typo is deliberate, for once!. And this returns:

So it’s returned a row for every token it can find. This example was nice and simple as it only contains words and spaces. The type was to show that without a word breaker SQL Server will just return the string without interpreting it. So that was for spaces, now just to confirm for other characters. First of all commas:

select * FROM sys.dm_fts_parser ('"quick,brown"', 1033, 0, 0)

So, that returns 2 rows as expected. How about hyphens?:

select * FROM sys.dm_fts_parser ('"quick-brown"', 1033, 0, 0)

Interesting, we get 3 rows this time. This makes sense when you think that hyphenation isn’t an exact science so some people will use them, and some won’t. So by combining both ‘sides’ of the hyphenated word SQL Server can hopefully match both uses.

So, that’s the basic theory. So this little piece of T-SQL is going to loop through all the 255 ASCII characters. For each one we’re going to use it to join 2 ‘words’, and then run that string through sys.dm_fts_parser. If the function returns more than 1 row we now that it’s found a word breaker, so we then output the character, and the character code as not all the characters are printable. You’ll also notice that code 34 throws an error, that’s because it’s ” which is a reserved character within full Text searches.

declare @i integer
declare @cnt integer
set @i=0
while @i&lt;255
begin
  set @cnt=0
  select @cnt=COUNT(1) FROM sys.dm_fts_parser ('"word1'+CHAR(@i)+'word2"', 1033, 0, 0)
  if @cnt&gt;1
  begin
    print 'this char - '+char(@i)+' - char('+convert(varchar(3),@i)+') is a word breaker'
  end
  set @i=@i+1
end

Which gives a nice long list:

this char -  - char(1) is a word breaker
this char -  - char(2) is a word breaker
this char -  - char(3) is a word breaker
this char -  - char(4) is a word breaker
this char -  - char(5) is a word breaker
this char -  - char(6) is a word breaker
this char -  - char(7) is a word breaker
this char -  - char(8) is a word breaker
this char - 	 - char(9) is a word breaker
this char - 
 - char(10) is a word breaker
this char -  - char(11) is a word breaker
this char -  - char(12) is a word breaker
this char - 
 - char(13) is a word breaker
this char -  - char(14) is a word breaker
this char -  - char(15) is a word breaker
this char -  - char(16) is a word breaker
this char -  - char(17) is a word breaker
this char -  - char(18) is a word breaker
this char -  - char(19) is a word breaker
this char -  - char(20) is a word breaker
this char -  - char(21) is a word breaker
this char -  - char(22) is a word breaker
this char -  - char(23) is a word breaker
this char -  - char(24) is a word breaker
this char -  - char(25) is a word breaker
this char -  - char(26) is a word breaker
this char -  - char(27) is a word breaker
this char -  - char(28) is a word breaker
this char -  - char(29) is a word breaker
this char -  - char(30) is a word breaker
this char -  - char(31) is a word breaker
this char -   - char(32) is a word breaker
this char - ! - char(33) is a word breaker
Msg 7630, Level 15, State 3, Line 7
Syntax error near 'word2' in the full-text search condition '"word1"word2"'.
this char - # - char(35) is a word breaker
this char - $ - char(36) is a word breaker
this char - % - char(37) is a word breaker
this char - &amp; - char(38) is a word breaker
this char - ( - char(40) is a word breaker
this char - ) - char(41) is a word breaker
this char - * - char(42) is a word breaker
this char - + - char(43) is a word breaker
this char - , - char(44) is a word breaker
this char - - - char(45) is a word breaker
this char - . - char(46) is a word breaker
this char - / - char(47) is a word breaker
this char - : - char(58) is a word breaker
this char - ; - char(59) is a word breaker
this char - &lt; - char(60) is a word breaker
this char - = - char(61) is a word breaker
this char - &gt; - char(62) is a word breaker
this char - ? - char(63) is a word breaker
this char - @ - char(64) is a word breaker
this char - [ - char(91) is a word breaker
this char - \ - char(92) is a word breaker
this char - ] - char(93) is a word breaker
this char - ^ - char(94) is a word breaker
this char - { - char(123) is a word breaker
this char - | - char(124) is a word breaker
this char - } - char(125) is a word breaker
this char - ~ - char(126) is a word breaker
this char -  - char(127) is a word breaker
this char -  - char(129) is a word breaker
this char - ‚ - char(130) is a word breaker
this char - „ - char(132) is a word breaker
this char - … - char(133) is a word breaker
this char - † - char(134) is a word breaker
this char - ‡ - char(135) is a word breaker
this char - ‰ - char(137) is a word breaker
this char - ‹ - char(139) is a word breaker
this char -  - char(141) is a word breaker
this char -  - char(143) is a word breaker
this char -  - char(144) is a word breaker
this char - “ - char(147) is a word breaker
this char - ” - char(148) is a word breaker
this char - • - char(149) is a word breaker
this char - – - char(150) is a word breaker
this char - — - char(151) is a word breaker
this char - › - char(155) is a word breaker
this char -  - char(157) is a word breaker
this char -   - char(160) is a word breaker
this char - ¡ - char(161) is a word breaker
this char - ¦ - char(166) is a word breaker
this char - § - char(167) is a word breaker
this char - « - char(171) is a word breaker
this char -  - char(173) is a word breaker
this char - ± - char(177) is a word breaker
this char - ¶ - char(182) is a word breaker
this char - · - char(183) is a word breaker
this char - » - char(187) is a word breaker
this char - ¿ - char(191) is a word breaker
this char - × - char(215) is a word breaker
this char - ÷ - char(247) is a word breaker

As you can see there’s a lot of word breakers in there. And you can generate an even larger set if you use unicode, though this will need to loop through a lot more that 255 records, and you’ll need to use nchar() instead of char().

1 Comment

Add Comment →

Ralf

Hi, thank you for the good explanation.

Can you tell me please how to remove the word breakers completly

I want to use the fulltext search e.g. for “61000-01” but i get results with “61000” and “01” and so on. I think i can solve the problem if i remove all word breakers (expect SPACE)

April 11, 2019