Best way to upload 30k videos?

Chocobozzz · Janvier 16, 2024, 12:51

Can you show your code? And what is your instance URL (or version at least?)

GregRc · Janvier 16, 2024, 1:26

It’s ok now, it seems the doc is missing the info that contrary to a mere api/v1/videos/{id} call, api/v1/videos/{id}/source must be authorized, so it needs the Authorization: « Bearer » option

JohnLivingston · Janvier 16, 2024, 1:42

Indeed, this info is not public, and only viewable by the video’s owner

Great work.

Chocobozzz · Janvier 16, 2024, 1:54

Thanks! I’ll update the documentation

GregRc · Janvier 16, 2024, 1:54

This was rather useless anyway since I didn’t think first « ok I can get the filename of a video I know, but I can’t get a video from its filename » . Getting ALL videos and then getting ALL sources with just the API to find which one fits a given filename just feels ridiculous. I guess I either have to query the videoSource table (and learn postgreSQL first, but shouldn’t be far from mySQL), or just stick with the xls file.

GregRc · Janvier 16, 2024, 2:04

Unless of course @Chocobozzz feels like adding a query to the API returning a video by its source, or the list of sources

Chocobozzz · Janvier 16, 2024, 2:06

I’m sorry but it’s not planned, Framasoft doesn’t have time to implement this feature

GregRc · Janvier 16, 2024, 3:42

No pb, getting the info straight from database went smoother than I thought.

EricG · Janvier 18, 2024, 3:51

Just a little teasing of the great job of GregRc with Chocobozzz’s and JohnLivingston’s help… Still in progress

GregRc · Février 22, 2024, 2:24

Hi, my mass upload script works mostly but I’m having some upload errors I can’t figure, because usually I just restart the script a bit later and failed uploads then succeed, with no settings or code modification. Errors are of these types:

Error: write ECONNRESET
Error: write EPIPE
ENOSPC: no space left on device, write
SocketError: other side closed

I also get SyntaxError: Unexpected token '<', "<html><h"... is not valid JSON errors thrown because obviously html is sent instead of json, but I don’t know why or what’s in it.

It feels like some kind of traffic jam is happening. There are 101 videos involved, each a few minutes duration an a 300-700 Mb size range. At first 53 uploads went nicely and 22 failed (script ignored the remaining 26 as they were already online). A few minutes later those 22 kept failing, and a few days later they all succeeded. This happens on my local peertube with local files sent, as well as when script runs on a cdn to send files from there to a distant test peertube server. I’m still trying to figure what’s going on with those errors above, but I tell it here in case it rings a bell and someone has any clue.

edit: the full SocketError is
Peertube API callSync error "http://127.0.0.1:9001/api/v1/videos/upload": TypeError: fetch failed at Object.fetch (node:internal/deps/undici/undici:11457:11) after I fetch Peertube « upload » request.

JohnLivingston · Février 22, 2024, 4:06

Seems there is no space left. You just have to figure out where. Maybe on the partition where nginx stores incoming upload?

vid-bin · Février 22, 2024, 9:47

You have to make sure you have space left. Specifically you want to look for temporary upload locations. Nginx has one for example.

GregRc · Février 26, 2024, 8:46

I don’t think disk space is the issue: our peertube server has dozens gigs left, plus the ENOSPC is thrown on 2 uploads only among 101 and the script goes on to upload other vidéos with no pb. E.g. the ENOSPC is thrown after calling the videos/upload API with a 186 Mb mp4 file, while larger vidéos get uploaded a few steps later with no pb. I suppose if space was the pb, it would also break following uploads.

Edit: indeed that might be a space problem, as another upload on a different server was monitored by admins who saw the free space vanishing, and then many above errors started legitimately appearing. There’s a script that frees space once a video has been transcoded, but upload went quite faster than transcoding Which would explain why another try a few days later could succeed.

sanjeevmansotra · Février 27, 2024, 5:30

Hello thanks for sharing this information, I Sanjeev Mansotra was looking at it for a long time. Thank you for the hack!

EricG · Mars 20, 2024, 10:33

Hi, here is a quick update.
We’re now writing the doc for this peertube-mass-upload script, and we hope to release it in the next 2 or 3 months. Once done, we will publish here the link to it.
Thanks

Ash3T · Avril 3, 2024, 6:48

Yeah! Looking forwarding to check it out.

Faustina · Avril 28, 2024, 4:43

Dear Eric

Any chance to test the peertube-mass-upload script ? maybe I can be your tester and will give you the feedback.

Thank you very much for your effort Eric …awesome

GregRc · Avril 29, 2024, 9:42

Hi,

I plan to release PMU (which no longer means Pari Mutuel Urbain but Peertube Mass Uploader, sorry my French folks) the sooner, but my coworkers and I need to figure how to publish this on Github so people can use and contribute, while still allowing us to work on and deploy it. Obviously we don’t want to make public a project with our own settings.yaml containing server IPs and user passwords. Currently PMU is in our own restricted VPN backed Gitlab server, and those sensitive data are set in this Gitlab variables so that PMU can easily be deployed through CI with every change in its main branch. But then there’s a .gitlab-ci.yml & exclude-rsync.txt that still need to be versionned and restricted to my organization, or at the very least have nothing to do in a public Git as they’ll have no sense for other users. Maybe we’ll need to have PMU both in Github and our Gitlab, I don’t know yet, it’s a first to me and my organization. I’ll have a meeting tomorrow about that and I’d gladly have your views on how to handle this.

For now here’s the readme.md to know more. There’s also a french version I can share if you’d like (couldn’t join them as files). It’s a rather long markdown, you may want to check first and last two chapters. As stated in « Improved suggestions » the tool only handles XLSX files for now and that may be a problem for other users, but it suits our needs as this file comes from another tool, and PMU then adds to it Peertube infos after each upload.

Enough yacking, here’s the readme (which makes quite a huge post, sorry for that):

What is it?

Peertube Mass Uploader (PMU) is a Node.js script allowing bulk upload of videos to Peertube from MP4 files, posters (JPG, PNG, WebP), subtitles (SRT, VTT), and an XLSX data file. Once videos are uploaded, their Peertube data can be sent back into the XLSX file, like their URL, ID, short UUID, UUID, and upload date.

How does it work?

PMU scans a folder for mp4 files. Based on the name of the first mp4 file found, it retrieves its associated files (poster, subtitle) and its data in the XLSX file. It also uses this file name to check if the video is indeed absent from Peertube. If so, it uploads the video to Peertube with all the retrieved data and, if desired, sends Peertube information about the video back to the XLSX. Then it moves on to the next mp4 file.

Requirements

Node.js (v20.11.1)
npm (v9)
Required modules installed using:

npm install

Configuration is done via two files:

An .xlsx file containing video metadata (title, description, channel, etc.) where the URL and IDs of each video can be injected after upload.
A settings.yaml file defining script usage parameters.

Additionally, some arguments can be defined at script launch.

XLSX file

This file contains data for each video, such as its title or description, with one video per row and one data per column.
The order of these columns is arbitrary but needs to be indicated in data:in: in settings.yaml.

All these columns contain text as it appears in the front end, not IDs. For example, if you want to add a video to the channel named « Concert Captures » with the ID « concert_captures_1 », its channel column should contain « Concert Captures » (or « concert captures » because case is ignored).

For more information on the content of these columns, see data:in:.

settings.yaml file

Environments

Each key at the root level of this file defines an environment. There must be at least default:, which is the environment used if none is specified when launching the script with the --env argument. Other environments can be created by giving them any name and placing the parameters that will override the default ones, as values from default are retrieved first regardless of the environment used.

Use the default: environment for generic values, and create other environments to put only specific values there.

Parameters

To be placed in environments. All parameters are required unless otherwise stated.

`misc:`

baseurl: base URL of Peertube, e.g., http://127.0.0.1:9001/. The API will then be called at *http://127.0.0.1:9001/api/v1/*.
uploadPauseMs: (optional) pause time in ms after each upload (none by default).
limit: (optional) limits processing to the first N MP4 files in the paths:todo: folder, in alphabetical order. By default, all are processed. The value 0 cancels this limit, and lower values return all MP4 files except the last -N. Can be overridden when launching the script with the --limit argument.

`user:`

name: Peertube account name to use.
pwd: Password of this account.

Example for default local environment:

user:
  name: root
  pwd: test

`db:`

Parameters for accessing the Peertube PostgreSQL database. Default example for local:

db:
  host: 127.0.0.1
  port: 5432
  database: peertube_dev
  user: postgres
  password: postgres

`paths:`

todo: folder where the files to upload are located: MP4, posters, and subtitles (subfolders are ignored).
wip: folder where the tool will move each MP4 file before processing.
done: folder where the tool will move files once successfully uploaded.
doubles: folder where the tool will move already existing videos.
failed: folder where the tool will move files in case of failure.

When an MP4 file is moved, its related and used files (subtitles, poster) are moved with it.

`files:`

`files:identifierRegex:`

Case-sensitive regular expression applied to each MP4 file name and used to build the video identifier in the XLSX file (see data:identifierValue:), to know if it already exists in Peertube (see data:dbCheck:), and to retrieve its poster and subtitle files (see below).

`files:posters:`

List of files that can be used as posters, in order of preference. The first existing file will be used. Each item in the list is applied to files:identifierRegex: to build a file name. For example:

posters:
  - $1.jpg
  - $1.jpeg
  - $1.webp
  - $1.png

`files:captions:`

List of files that can be used as subtitles, similar to posters, for example:

captions:
  - $1.vtt
  - $1.srt

`data:`

`data:file:`

Path to the XLSX file containing the data.

`data:identifierValue:`

Pattern applied on files:identifierRegex: to build the string searched in the identifier column of the XLSX file (see data:in:identifier:).

`data:dbCheck:`

Pattern applied on files:identifierRegex: to check in the database if the video already exists or not, using a SIMILAR TO on the filename field of the videoSource table. If the video is already in Peertube, it will be moved to the doubles folder (see paths).

Example

A file named « quite-long-video-2023-08-31.mp4 » corresponds to the row in the XLSX where the column identifier (defined in data:in:) contains « quite-long-video ». We do not want to upload the video to Peertube if it is found under the name « quite-long-video-2023-08-31.mp4 » or « quite-long-video-2024-02-20.mp4 », or any other date. The same applies to all other videos.

Therefore, the video’s identifier will be built using its filename minus its last 15 characters. This results in « quite-long-video ».
The script will search in the XLSX for the row where the column « identifier » contains this identifier « quite-long-video ».
To check if this video already exists in Peertube, we will use this identifier followed by any 15 characters, as it may exist with different dates.

The setting will be:

files:
  identifierRegex: ^(.*).{15}$
data:
  identifierValue: $1
  dbCheck: $1_{15}

files:identifierRegex: The identifier is defined by capturing all the characters of the filename, except the last 15.
data:identifierValue: The identifier searched for in the XLSX will simply be this identifier.
data:dbCheck: A search is made in the database for a video whose source is this identifier followed by any 15 characters. If it exists, the upload is canceled.

Use Regex101 to test regex substitutions.

`data:in:`

This is a list of keywords that identify the role of each column in the XLSX file, from left to right. If a column has no role in the upload tool, use any value that is not a keyword. For example, if the first column contains the identifier of the video, the third its title, and the second is of no particular use, the first element of this list will be identifier, the third title, and the second useless or any other value (even an empty string or one already present elsewhere).

- identifier

Column containing the identifier used to find the video in the XLSX from the file name (see data:identifierValue:).

- title

Column containing the video title.

- description

Column containing the video description.

- channel

Column containing the Peertube channel of the video. This channel must exist before upload; otherwise, the script will move the video to paths:failed: and move to the next one.

- playlists

Column containing the playlist(s) of the channel where the video will be placed (separated by a semicolon). Unlike channels, the script will create missing playlists.

- privacy

Column containing the video visibility. On the XLSX side, it must take one of the values indicated in data:privacies:.

- category

Column containing the video category based on values returned by the API videos/categories.

- language

Column containing the video audio language. On the XLSX side, it must take one of the values indicated in data:languages:.

- tags

Column containing the video keyword(s), separated by a semicolon.

- publicationDate

Column containing the video publication date, in « Y-m-d » format (e.g., « 2023-12-31 ») or just a year (in which case « -01-01 » will be added).

There are also optional keywords to define the columns hosting values returned by the tool. See data:out: below for more information.

`data:out:`

After each successful upload, the tool can write in the XLSX informations retrieved from Peertube. Just add one of the keywords below in this section, as well as in data:in: in the corresponding column in the XLSX.

- peertubeUrl

Video URL, e.g., http://127.0.0.1:9001/w/cZyCiabkHDqLrmy6sYmthv.

- peertubeId

Video ID, e.g., 155.

- peertubeShortUuid

Video short UUID, e.g., cZyCiabkHDqLrmy6sYmthv.

- peertubeUuid

Video UUID, e.g., 611e06c9-31a7-44bc-b5e9-604963b56ca9.

- uploadDate

Upload date in the format defined in data:uploadDateFormat:, e.g., 3/25/2024, 4:10:35 PM in « en-US » format or 25/03/2024 16:10:35 in « fr-FR » format.

`data:uploadDateFormat:`

Format used for writing the upload date in the XLSX, e.g., en-US or fr-FR. This is the first argument of Date.prototype.toLocaleString().

`data:languages:`

List of equivalences between languages in Peertube and those in the XLSX, in the format key: value. In the example below, the language column in the XLSX should contain « English » to indicate that the video is in English (see data:in:).

languages:
  en: English
  fr: French

`data:privacies:`

List of equivalences between privacies in Peertube and those in the XLSX. For example:

privacies:
  - Public
  - Unlisted
  - Private
  - Internal
  - Protected

`logs:`

Set each of the 5 levels to true or false depending on whether information related to them needs to be sent to the console, for example:

logs:
  debug: false
  info: true
  warning: true
  error: true
  fatal: true

Script launch

Once configuration is done, the script can be launched on the default environment with:

node upload.js | tee -a logs/upload.log

The node upload.js part runs the script, and the optional part | tee -a logs/upload.log adds (flag -a) the information displayed in the console to the logs/upload.txt file.

ATTENTION Ensure that the XLSX file is not open before upload as it makes it inaccessible for writing; therefore, the information retrieved from Peertube (URLs, IDs, etc.) cannot be saved there and will be lost.

Optional arguments

--env to set the environment to use.
--limit to process only a certain number of files, overriding the value possibly defined in settings.yaml.

For example, to process only the first 10 files in the prod environment:

node upload.js --limit=10 --env=prod | tee -a logs/uploadProd.log

Known issues

The XLSX file loses its formatting.
Licenses are not handled.
Uploading a video can sometimes fail with errors like « write EPIPE » or « write ECONNRESET ». Most of the time, this is related to a disk space issue or an overburdened CPU. Typically, simply restarting the tool later, once these resources are available again, will ensure that the same videos are successfully uploaded.

Suggested improvements

All fields are mandatory in the XLSX file, while some could have default values (e.g., channel or privacy), extracted from the filename (e.g., title), returned by a function (e.g., ‹ now › for publicationDate), or simply ignored.
Other data file formats should be allowed in addition to XLSX, such as CSV or JSON.
The tool retrieves all data from the XLSX as strings, but it should also be able to handle IDs.
Add unit and functional tests.

Chocobozzz · Avril 29, 2024, 1:37

When it’s out, don’t hesitate to create a MR on the documentation to add your tool Third party applications | PeerTube documentation

Ash3T · Juillet 11, 2024, 2:59

Hi, just to let you know that I am waiting for your release PMU. Looking forward to give it a try.