Pitfalls to Avoid When Building Your Own Video Sequencer on iOS
By Anton Kormakov, an iOS Developer at Rosberry
Hi! My name is Anton, I am an iOS developer at Rosberry.
Not long ago I happened to work on the project called Hype Type and had to solve several interesting tasks as to using video, text and animation. In this article I would like to tell you about the pitfalls and how to avoid them building a real-time video sequencer on iOS.
A little bit about the app itself
Hype Type lets a user record a set of short videos and/or several pictures with a total running time of 15 seconds, add text to the clip made and apply one of the available animations to it.
The key aspect of working with video in this case is that a user should have an option of managing video clips irrespective of each other: change playback speed, reverse, flip and (maybe in the versions to come) swap the clips on the go.
You can ask: ‘Why not use AVMutableComposition?’ and in most cases you will be right — it’s obviously a relatively handy video sequencer. But, alas, it has many constraints which did not let us use it. First off, it’s impossible to change and add tracks on the go — to get a changed video stream one would have to recreate AVPlayerItem and reinitialize AVPlayer. Also, working with images in AVMutableComposition is not that flawless — to add a static image to the timeline one has to use AVVideoCompositionCoreAnimationTool which will definitely add a great deal of overhead and will drastically slow down rendering.
A short web search did not bring to light other more or less appropriate solutions to cope with the task, so we decided to develop our own video-sequencer.
Let’s get the ball rolling
To start, a little bit about the structure of the rendering pipeline within the project. I have to say at once that I will not go into detail hoping that you are more or less in the know, otherwise this article would run riot. If you’re a newbie, I would recommend that you should pay closer attention to a well-known framework GPUImage (Obj-C, Swift) — a great starting point to get a handle on OpenGLES with clear illustrations.
The View, which is responsible for the recorded video on-screen rendering by timer (CADisplayLink) requests the frames from a sequencer. Since the app works mainly with video, it would be more logical to use YCbCr colorspace and send each frame as CVPixelBufferRef. On fetching each frame there are luminance and chrominance textures created which are sent to a shader program. One gets RGB images on the output which are presented to a user. Refresh loop in this case would look something like this:
Almost everything here is built using wrappers (for CVPixelBufferRef, CVOpenGLESTexture, etc) — this allows for bringing out the basic low-level logic to a separate layer and substantially simplify the basic moments of working with OpenGL. Of course, it has some drawbacks (mainly — a slight loss in performance and lesser flexibility), yet, they are not that critical. It’s worth clarifying: self.context — is rather a simple wrapper over EAGLContext making the work with CVOpenGLESTextureCache and multithreaded OpenGL calls easier. self.source — a sequencer deciding which frame from what track shall be given away to the view.
Now a couple of words about how we organized fetching of frames to be rendered. Since the sequencer shall work both with video and images, it would be logical to cover it with the common protocol. In this case the task of the sequencer would be to control the playhead and depending on its position give away a new frame from the relevant track.
The logic of how one gets the frames is handled by the objects implementing MovieSourceProtocol. Such a scheme makes the system omni-purpose and expandable as the only difference in image and video processing would be the method of getting frames.
So our VideoSequencer becomes really simple and the main issue remains to be identifying the current track and bringing all the tracks to have one and the same frame rate.
VideoSequencerTrack here is a wrapper over the object implementing MovieSourceProtocol which contains different metadata.
Working with statics
And now let’s come over to fetching the frames and give a closer look to one single case — an image representation. One can get it either from a camera - in this case we can get CVPixelBufferRef in a YCbCr format at once, then it would be enough just to copy it (why this is important, I will explain later) and give it back by request; or get it from a photo library — in this case one will have to jump through some hooks and manually convert it to a format needed. Converting from RGB to YCbCr may be processed in GPU, however, CPU of modern devices can cope with this task pretty fast, especially bearing in mind the fact that the app additionally crops and compresses an image before it is used. The rest is quite simple. The only thing one has to do is to give the very same frame in a given period of time.
Working with video
Now let’s add some video. We have decided to use an AVPlayer for it — mainly because it has an easy-to-handle API to get the frames, and takes care of the sound. Generally speaking, it sounds pretty simple but there are certain moments one has to pay greater attention to.
Let’s start with some obvious stuff:
One has to create AVURLAsset, load the information about the tracks, create AVPlayerItem, wait for the notification that it is ready for the playback and create AVPlayerItemVideoOutput with the parameters suitable for rendering — it has been that simple so far.
However, here we also have the first issue — seekToTime does not work fast enough and there are noticeable delays when we loop it. If one does not set the toleranceBefore and toleranceAfter parameters, it can hardly change anything, except for the fact that the delay already mentioned would be complemented with the positioning inaccuracy. This is the system restriction and one can not resolve it in full, but we can circumvent it. To make it happen it would be enough to prepare two AVPlayerItem’s and use them one by one — when one of them stops playing, the other one starts whereas the first one is rewound to the beginning and so it is looped.
The second issue which is also unpleasant but solvable — AVFoundation does not properly (seamlessly & smoothly) support the change of the playback and reverse speed for all types of files, and if in case, when one records video from a camera, the format can be controlled, in the situation when a user downloads a video from a photo library, we can not do that. Making a user wait for a video to be converted is actually a bad option, all the more so that it does not necessarily mean that they would use those settings. So it was decided to do it in the background and replace the original video with the converted one without appearing to do so.
MovieProcessor here is the service which gets the frames and audio samples from the reader and gives it back to the writer. (Actually, it can also process the frames received from the reader using GPU but it is used only when the whole project is rendered for the ready-made video to be laid over with the animation frames).
Now a bit more complicated stuff
If each clip is prepared for the playback as the need requires, there will be too noticeable delays. To prepare all the clips for the playback is not possible either (due to iOS constraints as to the number of h264 decoders working simultaneously). Of course, there is a way out of this situation and it is pretty simple — we can prepare a couple of tracks which would be played next, ‘wiping out’ those tracks which are not supposed to be used in the short run.
In such a simple way we managed to get a continuous playback as well as a loop. You are right, scrubbing would definitely result in a certain lag but it is not that critical.
In conclusion, I want to tell a little bit about the pitfalls you might face resolving similar issues.
First off — if you work with pixel buffers received from a device camera, you should either release them as soon as possible or copy if you want to use them later. Otherwise the video stream will be freezed — I, for one, have not found the description of this constraint in the documentation but it seems that the system tracks pixel buffers it gives away and will not give away new ones while the old ones are in the memory.
Second off — multithreading when working with OpenGL. OpenGL by itself is not a big friend of multithreading but one can pass it around using different EAGLContext’s which are in one and the same EAGLSharegroup — it will allow to bring the rendering logic of what a user will actually see on the screen apart from other background processes (like video processing, rendering, etc.) in a simple and fast manner.