Iā€™m @froztbyte more or less everywhere that matters

  • 33 Posts
  • 2K Comments
Joined 2 years ago
cake
Cake day: July 2nd, 2023

help-circle

  • when digging around I happened to find this thread which has some benchmarks for a diff model

    itā€™s apples to square fenceposts, of course, since one llm is not another. but it gives something to presume from. if g4dn.2xl gave them 214 tok/s, and if we make the extremely generous presumption that tok==word (which, well, no; cf. strawberry), then any Use Deserving Of o3 (letā€™s say 5~15k words) would mean you need a tok-rate of 1000~3000 tok/s for a ā€œreasonableā€ response latency (ā€œ5-ish secondsā€)

    so youā€™d need something like 5x g4dn.2xl just to shit out 5000 words with dolphin-llama3 in ā€œquickā€ time. which, again, isnā€™t even whatever the fuck people are doing with openaiā€™s garbage.

    utter, complete, comprehensive clownery. era-redefining clownery.

    but some dumb motherfucker in a bar will keep telling me itā€™s the future. and I get to not boop 'em on the nose. le sigh.


  • following on from this comment, it is possible to get it turned off for a Workspace Suite Account

    1. contact support (? button from admin view)
    2. ask the first person to connect you to Workspace Support (otherwise youā€™ll get some made-up bullshit from a person trying to buy time or Case Success or whatever, simply because they donā€™t have the privileges to do what youā€™re asking)
    3. tell the referred-to person that you want to enable controls for ā€œGemini for Google Workspaceā€ (optionally adding that you have already disabled ā€œGemini Appā€)

    hopefully you spend less time on this than the 40-something minutes I had to (a lot of which was spent watching some poor support bastard start-stop typing for minutes at a time because they didnā€™t know how to respond to my request)



  • so, for an extremely unscientific demonstration, here (warning: AWS may try hard to get you to engage with Explainer[0]) is an instance of an aws pricing estimate for big handwave ā€œsome gpu computeā€

    and when I say ā€œextremely unscientificā€, I mean ā€œI largely pulled the numbers out of my assā€. even so, theyā€™re not entirely baseless, nor just picking absolute maxvals and laughing

    parameters assumptions made:

    • ā€œsomewhat beefyā€ gpu instances (g4dn.4xlarge, selected through the tried and tested ā€œsquint until it looks rightā€ method)
    • 6-day traffic pattern, excluding sunday[1]
    • daily ā€œ4h peakā€ total peak load profile[2]
    • 50 instances mininum, 150 maximum (letā€™s pretend weā€™re not openai but are instead some random fuckwit flybynight modelfuckery startup)
    • us west coast
    • spot instances, convertible spot reserves, 3y full prepay commit (yeah I know full vs partial is a big diff; once again, snore)

    (and before we get any fucking ruleslawyering dumb motherfuckers rolling in here about accuracy or whatever: get fucked kthx. this is just a very loosely demonstrative example)

    so youā€™d have a variable buffer of 50ā€¦150 instances, featuring 3.2ā€¦9.6TiB of RAM for working set size, 800ā€¦2400 vCPU, 50ā€¦150 nvidia t4 cores, and 800ā€¦2400GiB gpu vram

    letā€™s presume a perfectly spherical ops team of uniform capability[3] and imagine that we have some lovely and capable active instance prewarming and correct host caching and whatnot. yā€™know, things to reduce user latency. letā€™s pretend weā€™re fully dynamic[4]

    so, by the numbers, then

    1y times 4h daily gives us 1460h (in seconds, thatā€™s 5256000). this extremely inaccurate full-of-presumptions number gives us ā€œservice-capable life timeā€. the times your concierge is at the desk, the times you can get pizza delivered.

    x3 to get to lifetime matching our spot commit, x50ā€¦x150 to get to ā€œtotal possible instance hoursā€. which is the top end of our sunshine and rainbows pretend compute budget. which, of course, we still have exactly no idea how to spend. because we donā€™t know the real cost of servicing a query!

    but letā€™s work backwards from some made-up shit, using numbers The Poor Public gets (vs numbers Free Microsoft Credits will imbue unto you), and see where we end up!

    so that means our baseline:

    • upfront cost: $4,527,400.00
    • monthly: $1460.00 (x3 x12 = $52560)
    • whatever the hell else is incurred (s3, bandwidth, ā€¦)
    • >=200k/y per ops/whatever person we have

    3y of 4h-daily at 50 instances = 788400000 seconds. at 150 instances, 2365200000 seconds.

    so we can say that, for our deeply Whiffs Ever So Slightly values, a secondā€™s compute on the low instance-count end is $0.01722755 and $0.00574252 at the higher instance-count end! which gives us a bit of a handle!

    this, of course, entirely ignores parallelism, n-instance job/load/whatever distribution, database lookups, network traffic, allllllll kinds of shit. which we canā€™t really have good information on without some insider infrastructure leaks anyway. if we pretend to look at the compute alone.

    so what does $1000/query mean, in the sense of our very ridiculous and fantastical numbers? since the units are now The Same, we can simply divide things!

    at the 50 instance mark, weā€™d need to hypothetically spend 174139.68 instance-seconds. thatā€™s 2.0154 days of linear compute!

    at the 150 instance mark, 522419.05 instance-seconds! 6.070 days of linear compute!

    so! what have we learned? well, weā€™ve learned that we couldnā€™t deliver responses to prompts in Reasonable Time at these hardware presumptions! which, again, are linear presumptions. and thereā€™s gonna be a fair chunk of parallelism and other parts involved here. but even so, turns out itā€™d be a bit of a sizable chunk of compute allocated. to even a single prompt response.

    [0] - a product/service whose very existence I find hilarious; the entire suite of aws products is designed to extract as much money from every possible function whatsoever, leading to complexity, which they then respond to byā€¦ producing a chatbot to ā€œguide usersā€

    [1] - yes yes I know, the world is not uniform and the fucking promptfans come from everywhere. Iā€™m presuming amerocentric design thinking (which imo is probably not wrong)

    [2] - letā€™s pretend that the calculatorsā€™ presumption of 4h persistent peak load and our presumption of short-duration load approaching 4h cumulative are the same

    [3] - oh, who am I kidding, you know itā€™s gonna be some dumb motherfuckers with ansible and k8s and terraform and chucklefuckery






  • opinion: the AWB is too afrikaans for it to be likely that that is where he picked up his nazi shit. then-era ZA still had a lot of AF/EN animosity, and in a couple of biographies of the loon you hear things like ā€œhe hated life in ZA as a kid because ā€¦ {bullying}ā€, and a non-zero amount of that may have stemmed from AF bullying EN

    (icbw, itā€™s definitely not something Iā€™ve studied the history of the loonā€™s tendencies, but can speak to (at least part[0]) of the ZA attitude)

    (([0] - I wasnā€™t alive at the time it wouldā€™ve mattered to him, but other bits of the cultural attitudes lasted well into my youth))