[Federated Learning] Privacy-preserving recommender system for Peertube?

Hi folks!
I am Marc, a PhD student working on Privacy-Preserving Machine Learning (with a slight focus on Federated Learning). Being a Fediverse enthusiast, I am looking for some applications of my research work in the Fediverse. My will is to bring ML to the Fediverse. Don’t get me wrong, I don’t want to force ML features in the Fediverse but only cover necessary features (if exist). In this context, Peertube might be a good fit if you want to build a privacy-preserving recommender system: such system would be collaboratively trained by all instances (i.e., all instances accepting the recommendation extension). No private data would be revealed (= each instance keeps its data locally) and each instance might even have a slightly personalized model (depending on the particular interests of its community).

This might be already too much details since my first question for you is: « Would the Peertube project/developers be interested by such automated recommender system? » I saw a relatively old open issue about recommendations on GitHub but I would like to know whether informal discussions happened since then. Maybe, you even have some developers working on it?

NB: I first posted this message on Matrix and later discovered this forum which is clearly a better place to discuss such complex feature.

1 « J'aime »

Hi!

What kind of privacy-preserving recommendations this feature would give? Is it for users/admins?

Hi,
This can be defined depending on your needs. The first step will be to have a private training: the training step of the recommendation model will be done collaboratively between the instances (keeping their data private). They just exchange updates of Machine Learning model to converge towards a common optimal model.
Once we have a good model trained over all the federation data, we can do « secure inference » where the user sends an encrypted query and receives encrypted recommendations (that are locally decrypted). However, it raises even more technical issues (still very interesting).

I would suggest to start with the following assumption: « Each user trusts its Peertube instance (= its admins) to store and process her preferences. » However, the user has no trust in the rest of the instances or even to any other user in the world. In this case, we would do the private training but the recommendation would be done in plaintext (between the user and her instance). Each instance would know the preferences of its users.

I hope it clarifies a bit my idea. Otherwise, I can detail more. BTW, this is just one possible setup. My goal is to discuss with you about your expectations in terms of recommendations and privacy so I can design a system satisfying exactly your needs. We can then discuss: what data is used? where is it stored? who is allowed to access it? who agrees to contribute to the training computations? etc.

1 « J'aime »

By the way, I haven’t said it explicitly yet but I would develop the feature myself and don’t expect the main developers to develop this new feature. My message is more a feature proposal than a feature request. However, I want to know whether it fits Peertube philosophy first because I will need some discussions with some Peertube developers to setup a satisfying architecture for this recommender. For now, I know Peertube only as a user and neither as an admin nor a developer.

I prefer more concrete and auditable recommendation systems honestly. something based on thumbs and watch time and tags and those of friends and the fediverse in general that could be read and understood by a competent coder.

Where I would love to see some machine learning is in optimizing the redundancy system and livestream network structure negotiation to optimize speed and efficiency.

I prefer more concrete and auditable recommendation systems honestly. something based on thumbs and watch time and tags and those of friends and the fediverse in general that could be read and understood by a competent coder

I am not sure to fully understand your point. I don’t know whether you prefer having a fully deterministic recommender with no ML or if you could accept an explainable ML model. To be honest, I would not go for a black-box such as neural network because Peertube instances doesn’t have enough data anyway (at least for now). Most likely, I would start with simple and interpretable models (e.g., tree models). Moreover, I have no commitment yet but I assume that the model would take as input watch time, thumbs, etc. to define some similar profiles to suggest new videos. Anyway, as powerful as ML can be, I can hear the concerns about filter bubbles created by recommended systems (even if some recommendation techniques succeed to increase the diversity via ML).

Where I would love to see some machine learning is in optimizing the redundancy system and livestream network structure negotiation to optimize speed and efficiency.

I had the intuition that something could be done there but had not concrete ideas so I would love to discuss about them if you have further insights!

Ah, thank you for the clarification, that sounds much better than my misapprehension. The other related videos and picks for autoplay could definitely use some improvement.

In regards to the redundancy issues. Currently it seems to just randomly determine which instance to grab the next file chunk from. A system that can learn which other hosts provide better latency, higher bandwidth, and fewer errors to an individual client could optimize the performance and and cut down on buffering issues. I admit to being pretty clueless on ML, but it seems like something where there are a myriad of factors specific to every viewer and host that affect the performance, but over times patterns would emerge that could be useful for avoiding buffering issues and improving the user experience.

I would be interested in it.

I started to collect tags on my instance to visualize a connected view of my content and thus hopefully facilitate recommendation

fetch('https://video.benetou.fr/api/v1/videos?count=100')
.then(function(response) {
  return response.json()
}).then(function(data) {
   console.log(data.data)
   data.data.map( v => {
     fetch('https://video.benetou.fr/api/v1/videos/'+v.id)
      .then(function(response) {
        if (response.status == 200) { return response.json() }
        else return null
      }).then(function(data) {        
        if (data && data.tags && data.tags.length > 0) console.log(data.name, data.id, data.tags)
      })
   })
})

but this mostly shows that there is not enough tag and content to be very useful for now.

Hopefully still, showing that gap will motivate me to tag more.

Anyway the point being that method is not going to scale but a numerical basis would. This prompts me to ask though, at what scale does this kind of method become usable? Do you need data from thousands of users over thousands of viewing?

Finally on a more direct technical note, I would suggest to ask if it already exists (as you did) but still, unless you can actually contribute to existing work, right away build your own plugin. PeerTube does have an extensible architecture PeerTube documentation so you can go much deeper than « just » change visuals on the frontend.

This prompts me to ask though, at what scale does this kind of method become usable? Do you need data from thousands of users over thousands of viewing?

This is a tough question to answer. I’ll expose four elements of reponse:

  1. I must admit that, while there is a lot of research on ML-based recommender systems, there are very few open-source recommendation systems used on a daily basis. Researchers use cleaned and static datasets while real-life recommender systems deal with noisy and evolving data. Then, it is hard to extrapolate these results.
  2. We are in a totally new setup: federation. This should require a bit more data than the centralised case but if we are able to include many instances in the training process it is ok. To start, I would appreciate having a cumulated (registered) user base of few thousands users and see how it goes.
  3. It depends on the model and the data I try to use. Peertube provides several types of data (e.g., tags, views, likes) and I will need to choose which one I use in my training… The complexity of the ML model also has a major impact on scaling. Obviously, I’ll try to start modestly with simple data and simple model.
  4. Peertube network has a lot of heterogeneity: some instances are specialised, some users may have several accounts (e.g., one for each area of interest), etc. It complexifies the learning task.

To sum up, this is one of the open research questions. We may have a rough start but it seems necessary to create this experimental field. Hopefully, it would draw attention and have a sufficient user base. I don’t expect millions of users but few thousands could be really nice to start.

I’ll have to discuss all these points with users/admins/maintainers of Peertube as well as researchers to solve the concrete problems we have. My current goal with this thread is to identify what are the foundations on which we can build a recommender system. Once I have a clearer view of the specifications, I’ll contact some colleagues to start solving all the problems.

Finally on a more direct technical note, I would suggest to ask if it already exists (as you did) but still, unless you can actually contribute to existing work, right away build your own plugin.

Thanks for the suggestion, I’ll most likely do that once the technical details are settled.

I’ll be honest that it is a medium-term perspective. All these discussions aim at preparing the research work I may concretely start in a year. Depending on the interest of my colleagues (or even of Peertube community), I may speed things up.

PS: Sorry, I keep sending long messages but I try to be as precise as possible in my intentions :slightly_smiling_face:

1 « J'aime »

The federation aspect is indeed totally different but very valuable.

This isn’t an ML nor federated ML example but I would argue https://github.com/crowdsecurity/crowdsec is a great example where the community must be convinced of the value of their input and how safe it is for them.

I, as a PeerTube admin, would be fine using your plugin to extract data and get some suggestion back but I would have to know exactly what data would be used (with prior reviewed source of the plugin) and that the resulting model would be usable and open without you.

1 « J'aime »

Thanks, I’ll take some inspiration from CrowdSec to build a project able to convince open-source communities.

I, as a PeerTube admin, would be fine using your plugin to extract data and get some suggestion back but I would have to know exactly what data would be used (with prior reviewed source of the plugin) and that the resulting model would be usable and open without you.

All this totally makes sense to me. Just so you know, using federated ML, you keep a total control on your data (the personal data of your users never leaves your instance) and ML model. The training process is nearly P2P so there is no central authority controlling the ML model. Everything will be open-source (code and model). At the end of the federated training, each instance has a copy of the ML model trained collaboratively and the inference is done directly on the instance. Contrary to modern ML systems, there is no central API storing the model and returning ML inferences.

However, I’ll still have some work to do to convince you (admins and users) that privacy is preserved during the training process (= the public model leaks no personal information about the users). ML researchers have theoretical results about it but they are not understandable for an average human-being. I’ll then make some popularization effort so everyone can understand and trust this recommender system.

Once the work is started, I’ll keep Peertube communities updated (via this forum) about my advances.

2 « J'aime »

Makes sense to me. I’d also advise partnering with a trusted organization like EFF, AccessNow, Exodus Privacy, etc to see if they would directly or otherwise recommend someone to audit your code.

It is one thing to provide an explanation but it’s another to believe in it. If it is done by a single entity with a bias in the result, namely you with a motivation for adoption, they it is not as reliable IMHO than if verified by a trusted 3rd party.

1 « J'aime »

Code audit is a nice idea but it would be a long-term goal.

Right now, my biggest concern is not to convince that the algorithms are securely implemented but to convince a user of the algorithm itself. In privacy-preserving ML, to preserve privacy in theory, you just need to add a little bit of noise to your data. To some extent, it seems a bit magical for someone who doesn’t know the maths behind it.

I know that someone from my research group is working with the CNIL (i.e., French authority ensuring the application of data privacy laws) to understand how these new privacy-preserving technologies can fit in the GDPR.

Then, CNIL might be an interesting third party in this project but a technical audit would still be needed in a second time. Anyway, everything will be open-source to let anyone understand how the data is processed.

I think that there is an important point: each instance’s admin should be able to choose with which instance it should interract.

I’ll explain why.

  1. There are many american alt-right instances. I don’t want my instance to learn suggestions from these instances.
  2. Using data from any instances would be an open door to spam: someone could inject false data (from false instances) to influence results

So, as an admin, I should be able to choose valid sources for your algorithms.
More: I should be able to revert some of the learning, if I detect an unwanted source.

And another point, just to keep it in mind: private videos viewing should not be taken into account for your learning, as it could leak some data (private tags, …).

1 « J'aime »

I think that there is an important point: each instance’s admin should be able to choose with which instance it should interract.

That’s an excellent remark because I had this question in mind. To be honest, I’ll probably try two kind of architecture: semi-decentralized and fully decentralized.

  • In the semi-decentralized, all instances contribute to the same model (hence, even the alt right). It should produce more accurate models in a honest scenario… but the real-world is not that perfect. However, it could be a nice starting point for experiments.
  • In the fully-decentralized, an instance only communicate with its followed instances. This creates a P2P learning (requiring a bit more data to converge) where each instance has a personalized recommendation model.

Later, I may imagine intermediate scenarios: semi-decentralized but where it is possible to filter some of the contributions.

More: I should be able to revert some of the learning, if I detect an unwanted source.

I’ll note that point for future investigations because it concerns a problem about which I have very small knowledge: Machine Unlearning. Anyway, it is very interesting.

And another point, just to keep it in mind: private videos viewing should not be taken into account for your learning, as it could leak some data (private tags, …).

I agree with that: only videos that can be recommended (i.e., public videos) are part of the training.

Small parenthesis, I think «Machine Unlearning» is a very important point nowadays. Thinking of the GDPR for example: anyone can ask a company to delete personnel data. But… What about data that were generated from your personnal data? Can a commercial company keep your computed profile?!
I don’t know if such a thing exists, but it definitively should.

That’s an excellent point. It is an open scienfitic and legal question. During the development, I’ll try (with the help of some colleagues) to provide satisfying answers to these questions!

1 « J'aime »