Why most lemmy instances blocked threads

DenizEfe@lemm.ee · 1 year ago

Why most lemmy instances blocked threads

tal · edit-2 1 year ago

Privacy - Meta has a really bad track record of user privacy. There is the worry that federating with them will result in them scraping user data from users (which IMO is a bit silly - Meta can and probably is scraping all the available public information anyway, defederating doesn’t really fix that).

Yeah, I was gonna say…as things stand, the privacy situation on the Threadiverse is in many respects weaker than on, say, Reddit. Yeah, you get to choose the third-party app that may live on a phone, or the Web client, and your instances only directly pushes some data out via federation.

However, if you’re on the Threadiverse, then you have no idea what a given Threadiverse instance out there pulling in federated data is storing. You don’t know how secure your instances is, even if your instance admin has the best intentions. Unless your instance is whitelisting a very limited set of trusted instances or isn’t federating at all and is private, treating anything you put out there as basically accessible to every organization and company is probably a good idea.

Your own instance may not retain deleted (including by mods or admins) or edited comments, but it’s a good bet that if someone else’s instance isn’t yet, they will, and they’ll permit recovering them. There were people doing this on Reddit via pushshift.io.

It’s probably possible to have people analyzing comment activity to detect where someone’s instance is, based on time-of-day and holiday and so forth activity; people had several sites doing this for Reddit.

And it’s probably not that hard to obtain a user’s IP address, so either you want to be okay with what you’re posting maybe being linked to your IP or avoid having a persistent IP, like, via use of a VPN or something. Probably possible for someone to at least roughly geolocate an IP. Might be possible to correlate it with other logs; if someone, for example, has access to someone’s Steam login history and can link that to an identity and can link both to an IP address at different times, they can probably deanonymize a user.

There are also text classifiers that can run on comments, extract things like someone’s likely gender and anything else that you’ve trained a statistical text classifier on a large-enough corpus. Probably can get at least approximate age, and I’ve seen classifiers that aim at identifying roughly where someone lives. Some famous examples of deanonymization via text:

Robert Hanssen, a very serious mole in the FBI, was caught after he used the phrase “the purple-pissing Japanese”, which was a quote from General George Patton, in an anonymous context, and someone had heard him use it once before (not a computer, just humans managed to pull this off). It’s probably possible to cross-correlate unusual phrases across identities; it doesn’t take many to form a unique signature.
The Federalist Papers were an important set of documents written under the pseudonym “Publius” by several major Founding Fathers in the US – Alexander Hamilton, James Madison, and John Jay. They argued for the ratification of the US Constitution. Some centuries later, computer-based Bayesian statistical analysis became practical, and it became possible to deanonymize most of the articles – train a classifier on their known works, then run it on their anonymous works, and get an estimate with confidence level as to the identity of the author. That was pretty nifty from a historian’s standpoint, but it’s worth considering that the same technique is also viable today to deanonymize people.

With Reddit or similar, Reddit’s probably gonna data-mine what they can and may sell it to some parties, but they also probably won’t be directly feeding it to random unsavory person, though it may wind up in their hands.

There are probably a couple of good ways that lemmy/kbin could legitimately improve privacy.

I don’t know what the logging situation is today, but having the option for an admin to bound log retention time might be a good idea; retaining enough for abuse and debugging, but not leaving a lot of data around in case someone breaks in and swipes 'em. You still need to trust your instance admin, and the lemmy/kbin software, but at least it’s possible for an admin to bound what gets swiped if someone breaks in.
Not allowing remote images in comments, which is presently permitted; as I point out above, that’s going to let user IP addresses be extracted by parties other than their instance. At least give the user the option to block them, and have home instances maybe have an option to cache them and serve them locally…that’ll create its own storage and bandwidth concerns, but one can at least imagine heuristics to deal with that.
Having some form of public/private key authentication – like, I can upload a pubkey to an account – to permit someone to prove that they are who they say they are in the event of later instance compromise.