WebRTC Media Streams

14 Apr 2020 - tsp
Last update 05 Jul 2020
Reading time 8 mins

WebRTC is currently a really hot topic in the web development and JavaScript community. It’s a set of protocols and APIs that can be used for real-time communication from within the typical browser environment it supports APIs that allow media capture into media streams (from screen sharing, webcams and other similar sources) as well as APIs and protocols that allow for peer to peer communication from within the browser. Strictly spoken these two features are not required to be used together. One can use the media capture APIs to simply gather data inside the browser, do some image processing (like implementing some object recognition using tensorflow.js) or perform any other manipulation on the captured data. On the other hand the streaming APIs can be used to do peer to peer communication between browsers and provide an more flexible way of communication than classical WebSockets that only allow communicating with a webserver. They can be used for audio and video streams, file transfers and any other data transfer between clients. To perform P2P connections the framework also uses interactive connection establishment (ICE) as specified by RFC8455. This allows client behind network address translation to build up P2P connections in many cases (at least via UDP, there exists a similar technique STUNT for TCP but that would require root privileges on network interfaces to inject fake packets into non-established TCP connections).

The following article provides a summary on how to capture media, the data transfer will be handled in a later blog post.

Streaming media

First off there are two different media device classes known to WebRTC:

userMedia summarizes webcams and similar devices
displayMedia is primarily used to implement screen sharing

Requesting a media device is always an asynchronous action. The user will always be asked - for webcams on the first visit of the application, for desktop sharing always - which device to use or which window or screen should be shared. As usual for asynchronous options these APIs return a promise that can be used to determine when the action completed asynchronously. If one want’s to code in an synchronous style one has as usual to use an async function oneself and can then await for a promise. If one wants to perform actions in a more modern event driven way one should use then

The following example is going to illustrate both ways of querying a shared screen or a webcam and assign them as a local video source. The video target is simply the well known HTML5 video element:

Note: Currently browsers require webpages to be served from sources that they determine to be secure. That means that the page cannot be opened from a local file and cannot be served through plain HTTP.

<!DOCTYPE html>
<html>
	<head>
		<title> Simple capture example </title>
	</head>
	<body>
		<h1 id="top">Simple capture example</h1>
		<video id="videoTarget"  autoplay playsinline controls="false"> </video>
	</body>
</html>

Now one can simply request the media device for screen sharing and after successful completing assign it to an video element:

async function screenCaptureExample() {
	if(!navigator.mediaDevices) {
		console.log("Failed to query media devices");
		return;
	}

	let targetElement = document.getElementById('videoTarget');
	let mediaStream = await navigator.mediaDevices.getDisplayMedia();
	targetElement.srcObject = mediaStream;
}

Alternatively one can use the event driven approach:

function screenCaptureEventExample() {
	if(!navigator.mediaDevices) {
		console.log("Failed to query media devices");
		return;
	}

	let mediaStream = navigator.mediaDevices.getDisplayMedia();
	mediaStream.then((stream) => {
		let targetElement = document.getElementById('videoTarget');
		targetElement.srcObject = stream;
	}).catch((err) => {
		console.log("Capture example failed: "+err);
	});
}

The same method can be used to query user media (i.e. for example webcams):

async function screenCaptureExample() {
	if(!navigator.mediaDevices) {
		console.log("Failed to query media devices");
		return;
	}

	let targetElement = document.getElementById('videoTarget');
	let mediaStream = await navigator.mediaDevices.getUserMedia({ audio : false, video : true});
	targetElement.srcObject = mediaStream;
}

function screenCaptureEventExample() {
	if(!navigator.mediaDevices) {
		console.log("Failed to query media devices");
		return;
	}

	let mediaStream = navigator.mediaDevices.getUserMedia({ audio : false, video : true});
	mediaStream.then((stream) => {
		let targetElement = document.getElementById('videoTarget');
		targetElement.srcObject = stream;
	}).catch((err) => {
		console.log("Capture example failed: "+err);
	});
}

As one can see one has to specify constraints for the user media source (i.e. should it support audio or video). It’s possible to specify additional constraints like a minimum or maximum resolution for video:

{
	video : {
		width : { min: 320, ideal : 640, max : 1280 },
		height: { min : 240, ideal : 480, max : 720 }
	}
}

Note that there are additional constraints (for example to select a specific device, etc.). This won’t be shown in this short blog post.

Taking snapshots

Taking snapshots of video frames into a canvas is pretty simple. Just create a context and use drawImage to transfer image data from the video frame into the canvas. This works because the video element just exposes the same object methods as HTMLVideoElement.

To demonstrate this we just add a simple canvas to the page:

<button id="takeSnapshot"> Take snapshot </button>
<canvas id="snapshotCanvas"> </canvas>

window.onload = function() {
	// ... old code ...

	document.getElementById('takeSnapshot').addEventListener('click', () => {
		takeSnapshotToCanvas();
	});
}

async function takeSnapshotToCanvas() {
	let canvas = document.getElementById("snapshotCanvas");
	let ctx = canvas.getContext('2d');
	let videoElement = document.getElementById('videoTarget');

	ctx.drawImage(videoElement, 0, 0, 320, 240);
}

Doing simple manipulation directly inside the browser

Now one can do simple image manipulation inside the canvas as usual. For example we can implement a simple greyscale filter again just to show how to access pixels. Note that this is by far not the most performant way to do such stuff (in fact it’s the most unperformant one - one shouldn’t use that in production but it’s nice for experimentation).

Again we’ll add another button to the page and attach an event handler inside the onload function:

<button id="takeSnapshotGreyscale"> Take snapshot in greyscale </button>

window.onload = function() {
	// ... old code ...

	document.getElementById('takeSnapshotGreyscale').addEventListener('click', () => {
		takeSnapshotToCanvas();
		filterCanvas();
	});
}

function filterCanvas() {
	let canvas = document.getElementById("snapshotCanvas");
	let ctx = canvas.getContext('2d');
	let imgData = ctx.getImageData(0, 0, ctx.canvas.width, ctx.canvas.height);
	let pixelData = imgData.data;

	for(let i = 0; i < pixelData.length; i += 4) {
		let intensity = parseInt(pixelData[i] * 0.299 + pixelData[i+1] * 0.587 + pixelData[i+2]*0.114);
		pixelData[i] = intensity;
		pixelData[i+1] = intensity;
		pixelData[i+2] = intensity;
	}

	ctx.putImageData(imgData, 0, 0);
}

Converting an captures image into something to be passed around

The easiest way is to use a data URL - in this case the whole image content will be base 64 encoded in the requested format to be passed around or stored as URL safe parameter:

	let canvas = document.getElementById("snapshotCanvas");
	let dataUri = canvas.toDataURL("image/png");
	// now dataUri can be passed to img.src or via any available data channel

There is a simple hack of assigning an image as data URI to the href attribute of an anchor element (<a>) and letting the link event bubble. This allows to trigger the default action of the browser when accessing the image.

How to perform realtime manipulation

As one has seen above it’s pretty simply to access pixels out of a canvas. This filter and snapshot function can of course be applied on a continuous basis. For example one can use the requestAnimationFrame function to periodically fetch frames from the video element and transfer them into a canvas. This canvas doesn’t have to exist visible on the page - it can be kept as JavaScript object. The same is true for the video element:

	var videoElement = document.createElement('video');
	videoElement.srcObject = mediaStream;
	var processingBuffer = document.createElement('canvas');

	function processFrame() {
		let ctx = processingBuffer.getContext('2d');
		ctx.drawImage(videoElement, 0, 0, 320, 240);

		// Add additional manipulation code here

		window.requestAnimationFrame(processFrame);
	}

	processFrame();

Now that’s of course only half of the story since the slow pixel by pixel access stays the same which is not really wise from the standpoint of performance. One should use - if possible - the power of the GPU to perform calculations. And that’s exactly what one can do using WebGL shaders. The trick is to pass the original canvas content as a texture into the GLSL pipeline and do processing inside the fragment shader by providing a pretty simple vertex model (for example a cube consisting of two triangles) onto which the original image is mapped as a texture (data access into the texture happens using texture2D inside the fragment shader).

This article might be expanded to contain an sample of such processing soon.