6 points by simianwords about 3 hours ago | 3 comments | View on ycombinator
pyentropy 13 minutes ago |
aabdi about 1 hour ago |
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations
__patchbit__ about 2 hours ago |
shanewei about 3 hours ago |
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>).