Hacker news

  • Top
  • New
  • Past
  • Ask
  • Show
  • Jobs

Ask HN: How are thinking efforts implemented?

6 points by simianwords about 3 hours ago | 3 comments | View on ycombinator

pyentropy 13 minutes ago |

Take a look at the harmony repo which specifies the internal OpenAI format - the effort level is specified in the context after the <start> tag - https://github.com/openai/harmony

Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>).

aabdi about 1 hour ago |

Different models do slight variants.

Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.

Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations

__patchbit__ about 2 hours ago |

At a guess. May be associated with token length context window. Down selecting is consistent with warning message, forcing cutoff to context window. The technical term cache being a synonym. Increasing the headroom for more "thinking" should allow the implementation to access more resources without warning about the cache breaking.

shanewei about 3 hours ago |

[dead]