Digital Worlds – The Blogged Uncourse

Originally published as Digital Worlds – Interactive Media and Game Design, a free learning resource on computer game design, development and culture, authored as part of an experimental approach to the production of online distance learning materials, many of the resources presented in the first incarnation of this blog also found their way into a for credit, formal education course from the UK’s Open University.

The blog was rebooted at the start of summer 2016 to act as a repository for short pieces relating to mixed and augmented reality, and related areas of media/reality distortion, as preparation for a unit on the subject in a forthcoming first level Open University course. Since then, it has morphed into a space where I can collect stories and examples of how representations of the physical world can be digitally captured, and audio, images and video media can in turn be manipulated in order to produce distorted re-presentations of the world that are perhaps indistinguishable from it…

Interlude – Enter the Land of Drawings…

One of the classic children’s British TV programmes from the 1970s was Simon in the Land of Chalk Drawings, a “meta-animation” in which the lead character, Simon, is able to enter the (animated) land of chalk drawings though his magic chalkboard.

On one reading, we can view the land of chalk drawings as a virtual reality experienced by Simon; on another, we can imagine the chalk board as a forerunner of an augmented reality colouring book.

“Drawn” and “real” worlds have also combined in other culturally significant creations, such as in the well known Take on Me music video by Norwegian 80s pop group Ah-ha.

ACTIVITY: what other TV programmes or videos do you remember from the past that either hinted at, or might provide inspiration for, augmented or mixed reality effects and applications?

At the time, the Take on Me video was a masterpiece of video compositing. But as photo- and video-manipulation tools develop, and as augmented reality toolkits become ever more available, the ability to produce similar styled videos may become commonplace.

For example, creating “pencil drawn” images from photos can be easily achieved using a range of filters in applications such as Adobe Photoshop:

And the Pencil Sketch tool in Adobe After Effects will apply a similar effect to videos.

The Pencil Sketch effect is a applied as the result of processing a video image directly. But a similar sort of end effect can also be created by applying a texture transformation to a motion-captured model.

By manipulating a model, rather than a video frame, we are no longer tied to purely re-presenting the captured video image. Instead, the capture can be manipulated and the performance transformed away from the original motions, as well as the original textures.

Photoshopping Audio…

By now, we’re all familiar with the idea that images can be manipulated – “photoshopped” – to modify a depicted scene in some way (for example, Even if the Camera Never Lies, the Retouched Photo Might…). Vocal representations can be modified using audio process techniques such as pitchshifting, and traditional audio editing techniques such as cutting and splicing can be used to edit audio files and create “spoken” sentences that have never been uttered before by reordering separately cut words.

But what if we could identify both the words spoken by an actor, and model their voice, so that we could edit out their mistakes, or literally put our own words in their mouths, by changing a written text that is then used to generate the soundtrack?

A demonstration of a new technique for editing audio was demonstrated by Adobe in late 2016 that does exactly that. An audio track is used to generate both a speech generating model and a text to speech track. This allows the text track to edited, not just in terms of rearranging the order of the originally spoken words, but also inserting new words.

Not surprisingly, the technique could raise concern about the “evidential” quality of recorded speech.

EXERCISE: Read the contemporaneous report of the Adoboe VoCo demonstration from the BBC News website “Adobe Voco ‘Photoshop-for-voice’ causes concern“. What concerns are raised in the report? What other concerns, if any, do you think this sort of technology raises?

The technique was reported in more detail in a SIGGRAPH 2017 paper:

An associated paper –  Zeyu Jin, Gautham J. Mysore, Stephen DiVerdi, Jingwan Lu, and Adam Finkelstein. VoCo: Text-based Insertion and Replacement in Audio Narration. ACM Transactions on Graphics 36(4): 96, 13 pages, July 2017 – describes the technique as follows:

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Voice Capture and Modelling

A key part of the Adobe VoCo approach is the creation of a voice model that can be used to generate utterances that sound like the spoken words of the person whose voice has been modelled, a technique we might think of in terms of “voice capture and modelling”. As the algorithms improve, the technique is likely to become more widely available, as suggested by other companies developing demonstrations in this area.

For example, start-up company Lyrebird have already demonstrated a service that will model a human voice from one minute’s worth of voice capture, and allow you to create arbitrary utterances from text spoken using that voice.

Read more about Lyrebird in the Scientific American article New AI Tech Can Mimic Any Voice by Bahar Gholipour.

Lip Synching Video – When Did You Say That?

The ability to use captured voice models to generate narrated tracks works fine for radio, but what about if you wanted to actually see the actor “speak” those words? By generating a facial model of a speaker, it is possible to use a video representation of an individual as a puppet whose facial movements can be acted by someone else, a technique described as facial re-enactment (Thies, Justus, Michael Zollhöfer, Matthias Nießner, Levi Valgaerts, Marc Stamminger, and Christian Theobalt. “Real-time expression transfer for facial reenactment“, ACM Trans. Graph. 34, no. 6 (2015): 183-1).

Facial re-enactment involves morphing features or areas from one face onto corresponding elements of another, and then driving a view of the second face from motion capture of the first.

But what if we could generate a model of the face that allowed facial gestures, such as lip movements, to be captured at the same time as an audio track, and then use the audio (and lip capture) from one recording to “lipsync” the same actor speaking those same words in another setting?

The technique, described in Suwajanakorn, Supasorn, Steven M. Seitz, and Ira Kemelmacher-Shlizerman. “Synthesizing Obama: learning lip sync from audio,” ACM Transactions on Graphics (TOG) 36.4 (2017): 95, describes the process as follows: audio and sparse mouth shape features from one video are associated using a neural network. The sparse mouth shape is then used to synthesize a texture for the mouth and lower region of the face that can be blended onto a second, stock video of the same person, and the jaw shapes aligned.

For now, the approach is limited to transposing the spoken words from a video recording of a person speaking one time to a second video of them. As one of the researchers, Steven Seitz, is quoted in Lip-syncing Obama: New tools turn audio clips into realistic video“[y]ou can’t just take anyone’s voice and turn it into an Obama video. We very consciously decided against going down the path of putting other people’s words into someone’s mouth. We’re simply taking real words that someone spoke and turning them into realistic video of that individual.”

Augmented Reality and Autonomous Vehicles – Enabled by the Same Technologies?

In Introducing Augmented Reality Apparatus – From Victorian Stage Effects to Head-Up Displays, we saw how the Pepper’s Ghost effect could be used to display information in a car using a head-up display projected onto a car windscreen as a driver aid. In this post, we’ll explore the extent to which digital models of the world that may be used to support augmented reality effects may also be used to support other forms of behaviour…

Constructing a 3D model of an object in the world can be achieved by measuring the object directly, or, as we have seen, measuring the distance to different points on the object from a scanning device and then using these points to construct a model of the surface corresponding to the size and shape of the object. According to IEEE Spectrum’s report describing A Ride In Ford’s Self-Driving Car“Ford’s little fleet of robocars … stuck to streets mapped to within two centimeters, a bit less than an inch. The car compared that map against real-time data collected from the lidar, the color camera behind the windshield, other cameras pointing to either side, and several radar sets—short range and long—stashed beneath the plastic skin. There are even ultrasound sensors, to help in parking and other up-close work.”

Whilst the domain of autonomous vehicles may seem to be somewhat distinct from the world of facial capture on the one hand, and augmented reality on the other, autonomous vehicles rely on having a model of the world around them. One of the techniques currently used in detecting distances to objects surrounding an autonomous vehicle is LIDAR, in which a laser is used to accurately detect the distance to a nearby object. But recognising visual imagery also has an important part to play in the control of autonomous and “AI-enhanced” vehicles.

For example, consider the case of automatic lane detection:

Here, an optical view of the world is used as the basis for detecting lanes on a motorway. The video also shows how other vehicles in the the scene can be detected and tracked, along with the range to them.

A more recent video from Ford shows the model of the world perceived from the range of sensors one of their autonomous vehicles.

Part of the challenge of proving autonomous vehicle technologies to regulators, as well as development engineers, is the ability to demonstrate what the vehicle thinks it can see and what it might do next. To this extent, augmented reality displays may be useful in presenting in real-time a view of a vehicle’s situational awareness of the environment it currently finds itself in.

DO: See if you can find some further examples of the technologies used to demonstrate the operation of self-driving and autonomous vehicles. To what extent do these look like augmented reality views of the world? What sorts of digital models do the autonomous vehicles create? To what extent could such models be used to support augmented reality effects, and what effects might they be?

If, indeed, there is crossover between the technology stack that underpins autonomous vehicles, computational devices developed to support autonomous vehicle operation may also be useful to augmented and mixed reality developers.

DO: read through the description of the NVIDIA DRIVE PX 2 system and software development kit. To what extent do the tools and capabilities described sound as if they may be useful as part of an augmented or mixed reality technology stack? See if you can find examples of augmented or mixed reality developers using such toolkits originally developed or marketed for autonomous vehicle use and share them in comments below.

Using Cameras to Capture Objects as Well as Images

In The Photorealistic Effect… we saw how textures from photos could be overlaid onto 3D digital models as well as how digital models could be animated by human puppeteers: using motion capture to track the movement of articulation points on the human actor, this information could then be used to actuate similarly located points on the digital character mesh; and in 3D Models from Photos, we saw how textured 3D models could be “extruded” from single photograph by associating points on them with a mesh and then deforming the mesh in 3D space. In this post, we’ll explore further how the digital models themselves can be captured by scanning actual physical objects as well as by constructing models from photographic imagery.

We have already seen how markerless motion capture can be used to capture the motion of actors and objects in the real world in real time, and how video compositing techniques can be used to change the pictorial content of a digitally captured visual scene. But we can also use reality capture technologies to scan physical world objects, or otherwise generate three dimensional digital models of them.

Generating 3D Models from Photos

One way of generating a three dimensional model is to take a basis three dimensional mesh model and map it onto appropriate points in a photograph.

The following example shows an application called Faceworx in which textures from a front facing portrait and a side facing portrait are mapped onto a morphable mesh. The Smoothie-3d application described in 3D Models from Photos uses a related approach.

3D Models from Multiple Photos

Another way in which photographic imagery can be used to generate 3D models is to use techniques from  photogrammetry, defined by Wikipedia as “the science of making measurements from photographs, especially for recovering the exact positions of surface points”. By using taking several photographs of the same object and identifying the same features in each of them, and then align the photographs, using the differential distances between features to model the three-dimensional character of the original objects.

DO: read the description of how the PhotoModeler application works: PhotoModeler – how it works. Similar mathematical techniques (triangulation and trilateration) can also be used to calculate distances in a wide variety of other contexts, such as finding the location of a mobile phone based on the signal strengths of three or more cell towers with known locations.

“Depth Cameras”

Peripheral devices such as Microsoft Kinect, the Intel RealSense camera and the Structure Sensor 3D scanner perceive depth directly as well as capturing photographic imagery.

In the case of Intel RealSense devices, three separate camera components work together to capture the imagery (a traditional optical camera) and the distance to objects in the field of view (an infra-red camera and a small infra-red laser projector).

With their ability to capture distance-to-object measures as well as imagery, depth perceiving cameras represent an enabling technology that opens up a range of possibilities for application developers. For example, itseez3d is a tablet based application that works with the Structure Sensor to provide a simple 3D scanner application that can capture a 3D scan of a physical object as both a digital model and a corresponding texture.

Depth Perceiving Cameras and Markerless mocap

Depth perceiving cameras can also be used to capture facial models, as the FaceShift markerless motion capture studio shows.

Activity: according to the FAQ for the FaceShift Studio application shown in the video below, what cameras can be used to provide inputs to the FaceShift application?

Exercise: try to find one or two recent examples of augmented or mixed reality applications that make use of depth sensitive cameras and share links to them in the comments below. To what extent do the examples require the availability of the depth information in order for them to work?

Interactive Dynamic Video

Another approach to use video captures to create interactive models is a new technique developed by researchers Abe Davis, Justin G. Chen, and Fredo Durand at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL)  referred to as interactive dynamic video. In this technique, or few seconds (or minutes) of video are analysed to study the way a foreground object vibrates naturally, or when gently perturbed.

Rather than extracting a 3-dimensional model of the perturbed object, and then rendering that as a digital object, the object in the interactive is perturbed by constructing a “pyramid” mesh over the pixels on the video image itself (Davis, A., Chen, J.G. and Durand, F., 2015. Image-space modal bases for plausible manipulation of objects in video. ACM Transactions on Graphics (TOG), 34(6), p.239). That is, there is no “freestanding” 3D model of the object that can be perturbed. Instead, it exists as a dynamic, interactive model within the visual scene within which it is situated. (For a full list of related papers, see the Interactive Dynamic Video website.)

SAQ: to what extent, if any, is interactive dynamic video an example of an augmented reality technique? Explain your reasoning.

Adding this technique to our toolbox, along with the ability to generate simple videos from still photographs as described in Hyper-reality Offline – Creating Videos from Photos, we see how it is increasingly possibly to bring imagery alive simply through the manipulation of pixels, mapped as textures onto underlying structural meshes.

Interlude – Ginger Facial Rigging Model

Applications such as Faceshift, as mentioned in The Photorealistic Effect…, demonstrate how face meshes can be captured from human actors and used to animate digital heads.

Ginger is a browser based facial rigging demo, originally from 2011, but since updated, that allows you to control the movements of a digital head.

If you enable the Follow on feature, the eyes and head will follow the motion of your mouse cursor about the screen. The demo is listed on the Google Chrome Experiments website and can be found here: https://sv-ginger.appspot.com. (The code, which builds on the three.js 3D javascript library, is available on Github: StickmanVentures/ginger.)

Recap – Enabling the Impossible

One of the recurring themes in this series of posts has been the extent to which particular augmented or mixed reality effects are impossible to achieve without the prior development of one or more enabling technologies.

The following video clip from Cinefix describing “The Top 10 VFX Innovations in the 21st Century” demonstrates how visual effects in blockbuster movies have evolved over several years as new techniques are invented, developed and then combined in new ways.

Here’s a quick breakdown of the top 10.

  • digital color-grading: recoloring films automatically to influence the mood of the film;
  • fluid modelling/water effects: bulk volume mesh vs. droplet (particle by  particle) models -> hybrid simulation
  • AI powered crowd animation: individuals have their own characters and actions that are then played out;
  • motion capture as a basis for photo-realistic animation;
  • universal capture/markerless performance capture;
  • painted face marker capture;
  • digital backlot;
  • imocap – in-camera motion capture – motion capture data captured alongside principal photography;
  • intermeshing of 3D digital backlot, live capture and live rendering, virtual reality camera;
  • lightbox cage rig, compositing of human actor and digital world.

DO: watch the video clip, noting what technologies were developed in order to achieve the effect or how pre-existing technologies were combined in novel ways to achieve the effect. To what extent might such technologies be used in a realtime mixed or augmented reality setting and for what purpose? What technical challenges would need to be overcome in order to use the techniques in such a way?

 

The Photorealistic Effect…

In Even if the Camera Never Lies, the Retouched Photo Might… we saw how photographic images may be manipulated using digital tools to create “hyperreal” imagery in which perceived “imperfections” in the real world artefact are removed using digital tools. In this post, we’ll explore how digital tools can be used to create digital imagery that looks like a photograph but were digitally created from the mind of the artist.

As an artistic style, photorealism refers to artworks in which the artist uses a medium other than photography to try create a representation of a scene that looks as if it was captured as a photograph using a camera.  By extension, photorealism aims toto (re)create something that looks like a photograph and in so doing capture a lifelike representation of the scene, whether or not the scene is imagined or is a depiction of an actual physical reality.

DO: Look through the blog posts Portraits Of The 21st Century: The Most Photorealistic 3D Renderings Of Human Beings (originally posted as an imgur collection shared by Reddit user republicrats) and 15 CGI Artworks That Look Like Photographs. How many of the images included in those posts might you mistake for a real photograph?

According to digital artist and self-proclaimed “BlenderGuru” Andrew Price in his hour long video tutorial Photorealism Explained, which describes some of the principles and tools that can be used in making photorealistic CGI (computer generated imagery), there are four pillars to creating a photorealistic image – modelling, materials, lighting, post-processing:

  • photorealistic modelling – “matching the proportions and form of the real world object”;
  • photorealistic materials – “matching the shading and textures of real world materials”;
  • photorealistic lighting – “matching the color, direction and intensity of light seen in real life”;
  • photorealistic post-processing – “recreating imperfections from real life cameras”.

Photorealistic modelling refers to the creation of a digital model that is then textured and lit to create to the digital image. Using techniques that will be familiar to 3D game developers, 3D mesh models may be constructed from scratch using open-source tools such as Blender or professional commercial tools.

blender_-_modeling_a_human_head_basemesh_-_youtube

The mesh based models can also be transformed in a similar way to the manipulation of 2D photos mapped onto the nodes of a 2D mesh.

Underpinning the model may be a mesh containing many thousands of nodes encompassing thousands of polygons. Manipulating the nodes allows the model to be fully animated in a realistic way.

Once the model has been created, the next step is to apply textures to it. The textures may be created from scratch by the artist, or based on captures from the real world.

In fact, captures provide another way of creating digital models by seeding them with data points captured from a high resolution scan of a real world model. In the following clip about the development of the digital actor “Digital Emily” (2008), we see how how 3D scanning can be used to capture a face pulling multiple expressions, and from these construct a mesh with overlaid textures grabbed from the real world photographs as the basis of the model.

Watch the full video  – ReForm | Hollywood’s Digital Clones – for a more detailed discussion about “digital actors”. Among other things, the video describes the Lightstage X technology used to digitise human faces. Along with “Digital Emily”, the video introduces “Digital Ira” , from 2012. Whereas Emily took 30 mins to render each frame, Ira could be rendered at 30fps (30 renders per second).

Price’s third pillar refers to lighting. Lighting effects are typically based on computationally expensive algorithms, incorporated into the digital artist’s toolchain using professional tools such as Keyshot as well as forming part of more general toolsuites such as Blender. The development of GPUs – graphical processing units – capable of doing the mathematical calculations required in parallel and ever more quickly is one of the reasons why Digital Ira is a far more responsive actor than Digital Emily could be.

The following video reviews some of the techniques used to render photorealistic computer generated imagery.

Finally, we come to Price’s fourth pillar – post-processing – things like motion blur, glare/lens flare and depth of field effects, where the camera can only focus at items a particular distance away and everything else is out of focus. In other words, all the bits that are “wrong” with a photographic image. (A good example of this can be found in the blog post This Image Shows How Camera Lenses Beautify or Uglify Your Pretty Face, which shows the same portrait photograph taken using various different lenses; /via @CharlesArthur.) In professional photography, the photographer may use tools such as Photoshop to create images that are physically impossible to capture using a camera because of the physical properties of the camera. Photo-manipulation is then used to create hyper-real images, closely based on reality but representing a fine tuning of it. According to Price, to create images that are photorealistic using tools that create perfect depictions of a well-textured and well-lit accurate model in a modelled environment, we need to add back in the imperfections that the camera, at least, introduces into the captured scene. To imitate reality, it seems we need to model just not the (imagined) reality of the scene we want to depict, but also the reality of the device we claim to be capturing the depiction with.

VideoRealistic Motion

In addition to the four pillars of photorealism described by Andrew Price when considering photorealistic still imagery, we might add another pillar for photorealistic moving pictures (maybe we should call this videorealistic motion!):

  • photorealistic motion – matching the way things move and react in real life.

When used as the basis of a animated (video) scene, a question arises as to how to actually animate the head in a realistic way. Where the aim is to recreate human like expressions or movements, the answer may simply be to use a person as a puppeteer, using motion capture to capture an actor’s facial expressions and use them to actuate the digital model. Such puppetry is now a commodity application, as the Faceshift markerless motion capture facial animation software demonstrates. (See From Motion Capture to Performance Capture – Sampling Movement in the Real World into the Digital Space for more discussion about motion capture.)

With Hollywood film-makers regularly using virtual actors in their films, the next question to ask is will such renderings be possible in a “live” augmented reality context: will it be possible to add sit a virtual Emily in your Ikea postulated sitting room and talk through the design options with you?

The following clip, which combines many of the techniques we have already seen – uses a 3D registration image within a physical environment as the location point for a digital actor animated using motion capture from a human actor.

In the same way that digital backlots now provide compelling visual recreations of background  – as well as foreground – scenery as we saw in Mediating the Background and the Foreground, it seems that now even the reality of the human actors may be subject to debate. By the end of the clip, I am left with the impression that I have no idea what’s real and what isn’t any more! But does this matter at all? If we can create photorealistic digital actors and digital backlots, does it change our relationship to the real world in any meaningful way? Or does it start to threaten our relationship with reality?

Interlude – AR Apps Lite – Faceswapping

In the post From Magic Lenses to Magic Mirrors and Back Again we reviewed several consumer facing alternate reality phone applications, such as virtual make-up apps In this post, we’ll review some simple face based reality distorting effects with an alternative reality twist.

In the world of social networks, Snapchat provides a network for sharing “disposable” photographs and video clips, social objects that persist on the phone for a short period before disappearing. One popular feature of snapchat comes in the form of its camera and video filters, also referred to as SnapChat Lenses, that can be used to transform or overlay pictures of faces in all sorts of unbecoming ways.

As the video shows, the lenses allow digital imagery to be overlaid on top of the image, although the origin of the designs is sometimes open to debate as the intellectual property associated with facepainting designs becomes contested (for example, Swiped –  Is Snapchat stealing filters from makeup artists?).

Behind the scenes, facial features are captured using a crude form of markerless facial motion capture to create a mesh that acts as a basis for the transformations or overlays as described in From Motion Capture to Performance Capture and 3D Models from Imagery.

Another class of effect supported by “faceswap” style applications is an actual faceswap, in which one person’s face is swapped with another’s – or even your own.

Indeed, New York songwriter Anthony D’Amato went one step further, using the app to swap his face with various celebrities to make a faceswapped video of him singing one of his own songs (/via Digital Trends (World’s first FaceSwap music video is equal parts creepy, impressive).

As well as swapping two human faces, faceswapping can be used to swap a human face with the face of a computer game character. For computer gamers wanting to play a participating role in the games they are playing, features such as EASports GameFace allow users to upload two photos of their face – a front view and a side view – and then use their face on one of the game characters models.

The GameFace interface requires the user to physically map various facial features on the uploaded photograph so that these can then be used to map the facial mesh onto an animated character mesh. The following article shows how facial features registered as a simple mesh on two photographs can be used to achieve a faceswap effect “from scratch” using open source programming tools.

DO: read through the article Switching Eds: Face swapping with Python, dlib, and OpenCV by Matthew Earl to see how a faceswap style effect can be achieved from scratch using some openly available programming libraries. What process is used to capture the facial features used to map from one face to the other? How is the transformation of swapping one face with another actually achieved? What role does colour manipulation play in creating a realistic faceswap effect?

If you would like to try to replicate Earl’s approach, his code is available on Github at matthewearl/faceswap. (A quick search of Github also turns up some other approaches, such as zed41/faceSwapPython and MarekKowalski/FaceSwap.)

Developing algorithms and approaches face tracking is an active area of research, both in academia and commercially. The outputs of academic research are often written up in academic publications. Sometimes, the implementation code is made available by researchers, although at other times it is not. Academic reports should also provide enough detail about the algorithms described for independent third parties to be able to implement, as is the case in Audun Mathias Øygard’s clmtrackr.

DO: What academic paper provided the inspiration for clmtrackr? Try running examples listed on auduno/clmtrackr and read about the techniques used in the posts Fitting faces – an explanation of clmtrackr and Twisting faces: some use cases of clmtrackr. How does the style of writing and explanation in those posts compare to the style of writing used in the academic paper? What are the pros and cons of each style of writing? Who might the intended audience be in each case?

UPDATE: it seems as if Snapchat may doing a line of camera enabled sunglasses – Snapchat launches sunglasses with camera. How much harder is it to imagine the same company doing a line in novelty AR specs that morph those around you in a humorous and amusing way whenever you look at them…?! Think: X-Ray spex adds from the back of old comics…

From Motion Capture to Performance Capture – Sampling Movement in the Real World into the Digital Space

In Augmented TV Sports Coverage & Live TV Graphics and From Sports Tracking to Surveillance Tracking…, we started to see how objects in the real world could be tracked and highlighted as part of a live sports TV broadcast. In this post, we’ll how the movement of objects tracked in the real world, including articulated objects such as people, can be sampled into a digital representation that effectively allows us to transform them into a digital objected that can be used to augment the original scene.

Motion capture, and more recently, performance capture, techniques have been used for several years by the film and computer games industry to capture human movements and use them to animate what amounts to a virtual puppet that can then be skinned as required within an animated scene. Typically, this would occur in post-production, where any latency associated with registering and tracking the actor, or animating and rendering the final scene, could largely be ignored.

However, as motion and performance capture systems have improved, so too has the responsiveness of these systems, allowing them to be used to produce live “rushes” of the captured performance with a rendered virtual scene. But let’s step back in time a little and look back at the origins of motion capture.

Motion capture – or mo-cap – refers to digitally capturing the movement of actors or objects for the purposes of animating digital movements. Markers placed on the actor or object allow the object to be tracked and its trajectory recorded. Associating points on a digital object with the recorded points allows the trajectory to be replayed out by the digital object. Motion capture extends the idea of tracking a single marker that might be used to locate a digital object in an augmented reality setting by tracking multiple markers with a known relationship to each other, such as different points on the body of a particular human actor.

An example of how motion capture techniques are used to animate the movement of objects, rather than actors, is provided by the BLACKBIRD adjustable electric car rig. This rig provides a customisable chassis  – the length of the vehicle can be modified and the suspension is fully adjustable – that can be used to capture vehicle movements, and footage from within the vehicle. Markers placed on the rig are tracked in the normal way and then a digital body shell overlaid on the tracked registration points. The adaptable size of the rig allows marked points on differently sized vehicles to be accurately tracked.  According to its designers, The Mill, an augmented reality application further “allows you to see the intended vehicle in CG, tracked live over the rig on location”.

Motion capture is a relatively low resolution or low fidelity technique that captures tens of points that can be used to animate a relatively large mass, such as a human character. However, whereas markers on human torsos and human limbs have a relatively limited range of free movement, animating facial expressions is far more complex, not least because the human brain is finely tuned to tracking human expressions on very expressive human faces. Which is where performance capture comes in.

Performance capture blends motion capture with at a relatively low resolution, typically, the orientation and relative placement of markers placed around limb joints, with more densely placed markers on the face. Facial markers are tracked using a head mounted camera along with any vocal performance provided by the actor.

Performance capture allows the facial performance of human actors to drive the facial performance of a digital character. By recording the vocal performance alongside the  facial performance, “lip synch” between the voice and mouth movements of the character can be preserved.

As real-time image processing techniques have developed, markerless performance capture systems now exist, particularly for facial motion capture, that do not require any markers to be placed on the actor’s face.

In the case of facial markerless motion capture, multiple facial features are detected automatically and used to implicitly capture the motion of those features relative to each other.

As well as real time mocap and markerless performance capture, realtime previews of the digitally rendered backlot are also possible. Andy Serkis’ tour of his Imaginarium performance capture studio for Empire magazine demonstrates this to full effect.

Virtual cameras are described in more detail in the following clip.

SAQ: What is a virtual camera? To what extent do virtual cameras provide an augmented or mixed reality view of the world?

Originally developed as techniques for animating movements and facial performances in games or films that were then rendered as part of a time-consuming post-production process, the technology has developed to such an extent that motion and performance capture now allow objects and actors to be tracked in realtime. Captured data points can be used to animate the behaviour of digital actors, on digital backlots, providing a preview, in real time, of what the finally rendered scene might actually look like.

For the actor in such a performance space, there is an element of make believe about the setting and the form of the other actors they performing with – the actors can’t actually see the world they are supposed to be inhabiting, although the virtual cameraman, and director, can. Instead, the actors perform in what is effectively a featureless space.

For the making of the film Gravity, a new rig was developed known as the Light Box, that presented the actors with view of the digitally created world they were to be rendered in, as a side of effect of lighting the actors in such a way that it looked as if the light was coming from the photorealistic, digital environment they would be composited with.

SAQ: how might performance capture and augmented reality be used as part of a live theatrical experience? What challenges would such a performance present? Feel free to let your imagination run away with you!

Answer: As Andy Serkis’ Imaginarium demonstrates, facilities already exist where photorealistic digital worlds populated by real world characters can be rendered in real time so the director can get a feel for how the finished scene will look as it is being short. However, the digital sets and animated characters are only observable to third parties, rather than the actual participants in the scene, and then only from the perspective of a virtual camera. But what would it take to provide an audience with a realtime rendered view of an Imaginarium styled theatre set? For this to happen at a personal level would require multiple camera views, one for each seat in the audience, the computational power to render the scene for each member of the audience from their point-of-view, and personal, see-through augmented reality displays for each audience member.

Slightly simpler might be personally rendered views of the scene for each of the actors so that they themselves could see the digital world they were inhabiting, from their perspective. As virtual reality goggles would be likely to get in the way of facial motion capture, augmented reality displays capable of painting the digital scene from the actor’s perspective in real time would be required. For film-makers, though, the main question to ask then would be: what would such immersion mean to the actor’s in terms of their performance? And it’s hard to see what the benefit might be for the audience.

But perhaps there is a middle ground that would work? For example, the used of projection based augmented reality might be able to render digital backlot, at least for a limited field of view. Many stage magicians create illusions that only work from a particular perspective, although it limits the audience size. Another approach might be to use a Pepper’s Ghost style effect, or even hide the cast behind on-stage behind an opaque projection screen and play out their digitally rendered performance on the screen. Live animated theatre, or a digital puppet show. A bit like the Gorillaz…

Motion and performance capture are now an established part of film making, at least for big budget film producers, and digital rushes of digital backlots and digital characters previewed in real-time alongside the actors’ performances. It will be interesting to see the extent to which similar techniques might be used as part of live performance in front of a live audience.

3D Models from Photos

In Hyper-reality Offline – Creating Videos from Photos we saw how a 3D parallax style effect could be used to generate a 3D style effect from a single, static photograph and in Even if the Camera Never Lies, the Retouched Photo Might… we saw how the puppet warp tool could be used to manipulate photographic meshes to reshape photographed items within a digital photograph. But can we go further than that, and generate an actual 3D model from a photo?

In 2007, a Carnegie Mellon research project released an application called Fotowoosh that allowed users to generate a thee dimensional model from a single photograph:

Another approach to generating three-dimensional perspectives that appeared in 2007 built up a three dimensional model by stitching together multiple photographs of the same scene from multiple perspectives. The original Photo-Synth application from Microsoft allowed users to navigate through a series of overlaid but distinct photographs in order to navigate the compound scene.

The current version, which can be found at photosynth.net, generates a photorealistic 3D model that can be smoothly navigated through. Here’s an example:

https://photosynth.net/preview/embed/632632d3-5f19-4379-98d9-2644c7aee3f7?delayload=true&autoplay=true&showannotations=true&showpromo=true

karen long neck hill tribe by sonataroundtheworld on photosynth

Photosynth is very impressive, but is more concerned with generating navigable 3D panoramas from multiple photographs than constructing digital models from photographs, models that can then be manipulated as an animated digital object.

In 2013, researchers from Tsinghua University and Tel Aviv University demonstrated a more comprehensive tool modeling for generating 3D models from a single photo.

The Fotowoosh website has long since disappeared, and the 3-Sweep software is no longer available, but other applications in a similar vein have come along to replace them. For example, Smoothie-3D allows you to upload a photo and apply a mesh to it that can then be morphed into a 3D model.

So why not grab a coffee, find an appropriate photo, and see if you can create your own 3D model from it.


Categories