Need to let loose a primal scream without collecting footnotes first? Have a sneer percolating in your system but not enough time/energy to make a whole post about it? Go forth and be mid: Welcome to the Stubsack, your first port of call for learning fresh Awful youā€™ll near-instantly regret.

Any awful.systems sub may be subsneered in this subthread, techtakes or no.

If your sneer seems higher quality than you thought, feel free to cutā€™nā€™paste it into its own post ā€” thereā€™s no quota for posting and the bar really isnā€™t that high.

The post Xitter web has spawned soo many ā€œesotericā€ right wing freaks, but thereā€™s no appropriate sneer-space for them. Iā€™m talking redscare-ish, reality challenged ā€œculture criticsā€ who write about everything but understand nothing. Iā€™m talking about reply-guys who make the same 6 tweets about the same 3 subjects. Theyā€™re inescapable at this point, yet I donā€™t see them mocked (as much as they should be)

Like, there was one dude a while back who insisted that women couldnā€™t be surgeons because they didnā€™t believe in the moon or in stars? I think each and every one of these guys is uniquely fucked up and if I canā€™t escape them, I would love to sneer at them.

Last weekā€™s thread

(Semi-obligatory thanks to @dgerard for starting this)

  • gerikson@awful.systems
    link
    fedilink
    English
    arrow-up
    9
    Ā·
    2 days ago

    Dude discovers that one LLM model is not entirely shit at chess, spends time and tokens proving that other models are actually also not shit at chess.

    The irony? Heā€™s comparing it against Stockfish, a computer chess engine. Computers playing chess at a superhuman level is a solved problem. LLMs have now slightly approached that level.

    For one, gpt-3.5-turbo-instruct rarely suggests illegal moves,

    Writeup https://dynomight.net/more-chess/

    HN discussion https://news.ycombinator.com/item?id=42206817

    • YourNetworkIsHaunted@awful.systems
      link
      fedilink
      English
      arrow-up
      9
      Ā·
      2 days ago

      Particularly hilarious at how thoroughly theyā€™re missing the point. The fact that it suggests illegal moves at all means that no matter how good itā€™s openings are the scaling laws and emergent behaviors havenā€™t magicked up an internal model of the game of Chess or even the state of the chess board itā€™s working with. I feel like playing games is a particularly powerful example of this because the game rules provide a very clear structure to model and itā€™s very obvious when that model doesnā€™t exist.

    • BigMuffin69@awful.systems
      link
      fedilink
      English
      arrow-up
      8
      Ā·
      edit-2
      2 days ago

      I remember when several months (a year ago?) when the news got out that gpt-3.5-turbo-papillion-grumpalumpgus could play chess around ~1600 elo. I was skeptical the apparent skill wasnā€™t just a hacked-on patch to stop folks from clowning on their models on xitter. Like if an LLM had just read the instructions of chess and started playing like a competent player, that would be genuinely impressive. But if what happened is they generated 10^12 synthetic games of chess played by stonk fish and used that to train the model- that ainā€™t an emergent ability, thatā€™s just brute forcing chess. The fact that larger, open-source models that perform better on other benchmarks, still flail at chess is just a glaring red flag that something funky was going on w/ gpt-3.5-turbo-instruct to drive home the ā€œeMeRgEnCeā€ narrative. Iā€™d bet decent odds if you played with modified rules, (knights move a one space longer L shape, you cannot move a pawn 2 moves after it last moved, etc), gpt-3.5 would fuckin suck.

      Edit: the author asks ā€œwhy skill go down thoā€ on later models. Like isnā€™t it obvious? At that moment of time, chess skills werenā€™t a priority so the trillions of synthetic games werenā€™t included in the training? Like this isnā€™t that big of a mysteryā€¦? Itā€™s not like other NN havenā€™t been trained to play chessā€¦

    • sc_griffith@awful.systems
      link
      fedilink
      English
      arrow-up
      14
      Ā·
      2 days ago

      LLMs sometimes struggle to give legal moves. In these experiments, I try 10 times and if thereā€™s still no legal move, I just pick one at random.

      uhh

    • Sailor Sega Saturn@awful.systems
      link
      fedilink
      English
      arrow-up
      7
      Ā·
      edit-2
      2 days ago

      Here are the results of these three models against Stockfishā€”a standard chess AIā€”on level 1, with a maximum of 0.01 seconds to make each move

      Iā€™m not a Chess person or familiar with Stockfish so take this with a grain of salt, but I found a few interesting things perusing the code / docs which I think makes useful context.

      Skill Level

      I assume ā€œlevelā€ refers to Stockfishā€™s Skill Level option.

      If I mathed right, Stockfish roughly estimates Skill Level 1 to be around 1445 ELO (source). However it says ā€œThis Elo rating has been calibrated at a time control of 60s+0.6sā€ so it may be significantly lower here.

      Skill Level affects the search depth (appears to use depth of 1 at Skill Level 1). It also enables MultiPV 4 to compute the four best principle variations and randomly pick from them (more randomly at lower skill levels).

      Move Time & Hardware

      This is all independent of move time. This author used a move time of 10 milliseconds (for stockfish, no mention on how much time the LLMs got). ā€¦ or at least they did if they accounted for the ā€œMove Overheadā€ option defaulting to 10 milliseconds. If they left that at itā€™s default then 10ms - 10ms = 0ms so šŸ¤·ā€ā™€ļø.

      There is also no information about the hardware or number of threads they ran this one, which I feel is important information.

      Evaluation Function

      After the game was over, I calculated the score after each turn in ā€œcentipawnsā€ where a pawn is worth 100 points, and Ā±1500 indicates a win or loss.

      Stockfishā€™s FAQ mentions that they have gone beyond centipawns for evaluating positions, because itā€™s strong enough that material advantage is much less relevant than it used to be. I assume it doesnā€™t really matter at level 1 with ~0 seconds to produce moves though.

      Still since the author has Stockfish handy anyway, itā€™d be interesting to use it in itā€™s not handicapped form to evaluate who won.