Research and Development

Mitigating LLM hallucinations in text summarisation

2024-06-26T13:47:14Z

I'm excited! After weeks of trying to get an LLM to output reliable, useful summaries of research interview transcripts, I've hit on something that works. This is a post about the journey to get here, and some details on the process I've arrived at. If you don't want any of the preamble or context, you can skip straight to "The Pipeline" below for specifics on the method I've patched together.

One of the main ways we do foresight in ��tv Research & Development's Advisory team is by interviewing experts and doing on those interviews. It's a good way of distilling insight and consensus, but it takes forever. You need to do a close read of every interview, picking out key quotes and identifying themes. You'll typically do many passes of this over the course of a project as your themes and understanding of the material develop. One of my colleagues, while we were deep in this analysis on last year, asked me: "couldn't we just get an AI to do this?" This was an interesting question. Whatever you think about the current wave of AI hype, it's undeniable that one thing LLMs are good at is finding patterns in swathes of language data. It's their fundamental function. So on the surface of it, analysing a bunch of interview transcripts (not even generating anything new!) seems like an ideal task for an LLM.

At the time, I wasn't keen on adding a learning curve to the big pile of work we already had to do. However, over the last couple of months I've had a bit of time to look into the question and do some tinkering with a locally-hosted model.

Saying this loud for the folks at the back

I need to make it absolutely clear here that I am not suggesting that human researchers can, or should, be replaced by LLMs and Python scripts. There's no way we should be directly relying on the outputs of LLMs in a research context. As my esteemed colleague Bill Thompson likes to remind us: LLMs hallucinate all the time, it's just that sometimes the hallucinations are useful. They make stuff up, and they miss subtle, but key, details.

Nor am I suggesting that we should be completely handing off the part of thematic analysis where we go over the source material to an LLM. It's an important step which cements the material in our minds and gets us thinking about the emerging themes and how they all fit together.

What I am hoping to do is generate sufficiently useful interview summaries to jump-start the analysis process with a good-enough first pass. Many people, me included, find it much easier to approach a writing task if we have something, anything, to edit.

While I'm disclaimering: I am in no way claiming to be an expert on AI or LLMs. I'm an experienced programmer, and I've got many years' experience with various bits and pieces of language processing, but I'll level with you: whenever I try to read any documentation or papers about machine learning stuff, it's more or less incomprehensible to me. I'm indebted to the many people who have put time and effort into making this stuff usable by non-experts[1], and those who take the time to post tips and tutorials online.

Drawing a circle in chalk

We knew from the start that we couldn’t run this experiment in ChatGPT or CoPilot. The ��tv, like many other large organisations, is taking a cautious approach to generative AI and LLMs and we have clear editorial guidelines and guidance on the use of AI that would rule this out.

So we made a few operational decisions to minimise risk and cleared the proposal with colleagues looking after responsible AI policies, data protection and infosec.

First, we anonymise the interview transcripts before sending them to the model for analysis, so it never has data about who was speaking. Second, we run the LLMs on our own hardware, on ��tv premises[2]. If I wanted to, I could get on a train to London and prod the server that we’re using with a screwdriver.[8]

This means that we’re in total control of how, when and where the models run, and no sensitive data leaves the ��tv estate for processing. Let us not forget that the cloud is, after all, .

Third, we don’t use the outputs from the LLM directly. They are used strictly as supporting material during a research process; they’re checked by human researchers against interview transcripts (and recordings, if necessary), and they won’t be part of the (human) written outputs of our research projects. With all that in mind, I spun up an instance of on an R&D server (via , which made this a pleasingly simple task), and set about writing some (Python) code to throw our interview summaries at the LLM.

It's kind of fascinating working with LLMs at this level. On the one hand, it's absolutely wild to me that I can get a computer to do a decent job of summarising some text literally by telling it, in English: "your task is to summarise a transcript of a research interview."

On the other hand, the technology is so new, and fast-moving, and (most importantly) black-boxed, that we're very much in the "feeling our way around" stage on the path to usefulness - some pivotal techniques I'm about to outline were gathered from various discussion threads where people share phrases and formulae they've arrived at essentially by empirical reasoning. Not for nothing has it been alleged that "."[3]

The last 10% is 90% of the effort

It’s not been a straightforward ride: the strategy I'm describing in this post was pretty much a Hail Mary before giving up entirely. The problem is hallucination: sometimes, LLMs just make up a bunch of stuff instead of sticking to the script. This is by design. An LLM has no actual knowledge, or reasoning capabilities as such: it's generating plausible output, given some opening text. It has no concept of 'truth' or 'accuracy' - it's been said that LLMs are "".[4]

In practise, this meant that I fairly quickly got to a process which was reliably generating plausible-looking interview summaries, but those summaries would often include things the interviewees never said. Not ideal, from a research point of view. The hard bit turned out to be minimising those hallucinations.

So while it's still fresh in my mind, I'm going to outline the recipe I've arrived at to get consistently useful, accurate interview summaries. It's a pipeline that consists of an LLM, some prompt engineering, and a Python script. I have another, higher-level post brewing on the overall workflow for thematic analysis I've been working on with an LLM in the loop. This post is the detail about a process I've cooked up for minimising LLM hallucinations while summarising text - one component of that overall workflow.

The pipeline

Prompt stuffing

This literally means 'stuffing' (or inserting) additional context into the prompt you send to the LLM. Prompt stuffing is a basic building block of 'retrieval augmented generation', or RAG - a general approach used in prompting LLMs which revolves around giving the model extra context needed to answer specialist requests and steer it to using that context over making stuff up. In this case, the extra context is a chunk of interview transcript. I read about it in the , and the only trick here is stuffing the context into the prompt fairly near the beginning, as discussed in the next section.

Keep it simple

Even with models that apparently have a large "context window" (basically, maximum input length), I've noticed that I get essentially garbage back from the LLM when I feed it too much text. It feels like challenging its 'attention span' makes it 'lose interest' and fall back on generating an answer from what it already knows - the language already present in the model.

So my script breaks down the transcripts into shorter chunks[9] using before combining them with the summarisation prompt and sending it off to the LLM. I also keep the prompt as pithy and unambiguous as I can, and stuff the interview chunk into it early, giving more detailed instructions after the interview text. This seems to do a better job of keeping the LLM on-task.

In this respect, I've come to think of LLMs as being capable research assistants who are very easily distracted, and design my prompts accordingly.[5] Of course, it's equally likely that there's something wrong with the way I'm running the model, or the setup on the server, that's causing the model's attention to drift. To reiterate: I'm not an expert. One to investigate later.

Explicitly instruct the LLM to stick to the provided material

I got this idea from , where user suggests using this phrasing to help mitigate hallucinations when using RAG.

Use “what you know” and only “What you know” to respond to the user.
...
“What you know” = {text}

I used this to focus the LLM's attention on the interview transcript stuffed into the prompt.

Ask the LLM to check its own work

This is also from the above mentioned Reddit thread, where user Jdonavan reports using this phrasing to keep GPT on-track:

Internally generate three possible answer, then evaluate each for accuracy against the context

I have no idea if it's actually generating possible answers and evaluating them internally, but adding an edict along these lines does a pretty good job of preventing the LLM from inventing quotes that weren't in the original text.

The prompt

Putting that all together, the prompt I'm currently using looks like this:

  [INST] You are an expert qualitative researcher skilled in thematic analysis. Your task is to summarise a transcript of a research interview.

Use “Interview Transcript” and only “Interview Transcript” to summarise the interview. Do not embellish or add detail to “Interview Transcript”.

### Interview Transcript ###:
{docs}

### Instructions ###:
1. Aim for a summary of around 400-500 words.
2. Your summary must include the following:
2a. The main topics and subjects covered.
2b. 5-7 most important findings and insights present in “Interview Transcript”.
2b1. Check these findings against “Interview Transcript” and reject any that do not have any supporting quotes in the text.
2c. 1-2 illustrative quotes for each key finding. 
2c1. Check these quotes against “Interview Transcript” and reject any that do not appear in the text.
2c2. Quotes must be transcribed exactly as they appear in “Interview Transcript”. Do not edit or summarise quotes.
2d. Any other relevant context or inferences you can make based on the interview content
2e. 1-2 suprising or unusual observations.
5. Format your summary in clear paragraphs with headings for each section. Use Markdown format. 
6. Interview transcripts have been anonymised. Do not name the interviewee.

[/INST]
Summary:

There's a few other ideas in here that I haven't mentioned elsewhere in the post: the use of ### to break the prompt up into sections is detailed in , and the ~~and [INST] tags are nicely explained in .~~

Checking the LLM's work

The final step that weeds out hallucinations is much dumber. My Python script finds quotes in a generated summary and compares them against the original text. If there are any words present in quotes from the summary that aren't in the original text (with a little margin for rephrasing or mistranscription)[6], we take that as an indicator that the LLM has been making stuff up and generate the summary again.

Because LLM output is non-deterministic, chances are that a freshly generated summary will be closer to the mark - the script also generates a new random seed and gradually increases the for each retry as a way to get more diverse output.

'Dumber' means reliable here though - by relying on a process which has very clear and repeatable criteria for a binary success or failure, we avoid the ambiguities inherent in a stochastic system.

Putting it together

The Python script I've written is a bit of a mess, and not very complicated - essentially it's just sticking together the steps I outline above. The tricky bit is the thinking and the prompt engineering. So I'm not planning on publishing that any time soon - although I might tidy it up a bit and release something, if there's any demand.

I'll be writing up the overall workflow soon.

Things that didn't work (yet)

Few-shot prompting

This is a technique where you provide a model answer as part of your prompt, to guide the LLM's response. I've seen it be really effective as a way of getting consistently formatted output from an LLM and ensuring it hits all the points you want.

Unfortunately, I've not got this to work reliably yet. Every time I've tried, I've ended up with the same kind of garbage I described above in "Keep it simple". My assumption here is that the model answer is pushing the length of the prompt into "drifting attention" territory. This is a shame, and I'd like to get it to work - we'll be returning to it when we do more investigation into the attention span issue.

MapReduce

I read about this strategy in the for . It's conceptually similar to the model used for searching big data stores: a document that's longer than an LLM's context length is broken into chunks that do fit into the context. A summary is generated for each chunk (the 'map' step), then an overall summary is generated by combining all the chunk summaries together (the 'reduce' step).

Whenever I tried to do the 'reduce' step with an LLM, I ended up getting the kind of garbage back that I've come to associate with it 'losing interest'. My assumption here is that the combined summaries are too long for the LLM to be able to summarise effectively, so for now the 'reduce' is being done by a person.

This isn't so bad from a humans-doing-thematic-analysis point of view — we need to review the summaries anyway, so doing a bit of mental work to understand and combine the chunk summaries is doing an important job of familiarising ourselves with the material and sinking it into our brains. We'll be going back to this though, it should in theory be doable.

Final thoughts

It's cost me a few late nights (and more than one migraine [7]), but I now have a system which can reliably produce decent summaries from interview transcripts. As a bonus, by peering over my shoulder and asking me what I'm doing while I work some of this through, my 8-year old son now has a more nuanced understanding of LLMs than your average AI hype merchant, so that's nice.

It should be obvious by now, but it bears repeating that this particular application of LLMs is never going to replace a human researcher. It’s bump-starting a specific part of a research process, which overall is heavily dependent on the instincts, knowledge, creative connection-making, and value judgements which remain stubbornly the forte of squishy human brains. I’m not sure LLMs ever could model these traits.

And I’d never trust the output of an LLM without checking it first. Hallucinations are an intrinsic part of the technology and there are always subtle little nuggets or allusions that an LLM will miss in an interview transcript.

However we’ve found some useful ways to mitigate hallucinations when using LLMs to summarise text and help pick out quotes and key themes in material gathered as part of our research.

1. An interesting thing to me about the current wave of LLM and generative AI is the amount of effort people are putting in to make this stuff accessible to others. This feels like a shift; I've sort of come to expect new technologies, especially open source ones, to come with an . I like this shift!

2. I just did the maths, and apparently they each consume about as much electricity as a high-end gaming graphics card, sometimes a little less. So consider my late-night hacking sessions equivalent to my having the time playing Cyberpunk 2077 or something, in terms of energy use.

3. I've written before about Suffice to say: not a fan.

4. I think that's a bit unfair, tbh, but it's a helpful generalisation to think with.

5. This is a remarkably similar set of strategies to some that I use with my small, chaotic children.

6. This could definitely be smarter; say by comparing , but this method works well enough for now, so why complicate things?

7. It has been suggested that adopt the NHM metric (Number of Henry Migraines) as a measure of research tasks, like Sherlock Holmes and his famous three-pipe problems. So far, I have resisted this proposal.

8. This would probably make a few of my colleagues quite cross, but it’s theoretically doable.

9. The exact length of the chunks is something that varies from model to model and is something I felt my way towards by setting it to 2048 (original LLaMa’s context window size) and increasing it by powers of 2 until the model’s output started to lose coherence.

Custom fonts for streaming subtitles

2024-05-01T13:22:55Z

Subtitles provide an on-screen text representation of our content and are used not only by viewers with hearing loss but by many other sections of our audience. Ensuring the readability of our subtitles on both broadcast and internet delivered services across the the different languages used within the UK is therefore of critical importance and the choice of font used has a significant impact on accessibility and legibility. ��tv Research & Development has recently contributed support for the DVB Font Downloading mechanism to the dash.js project to enable the ��tv to use our preferred ��tv Reith sans font.

On broadcast platforms, content providers have full control over the appearance of subtitles as they are broadcast in bitmap form. Subtitles for our internet streams work differently, with the client device rendering the subtitle text. Typically, content providers don’t control the platforms to which they deliver media, and so they have limited influence over the default font used to display subtitles. There is, however, a mechanism standardised by DVB that allows the content provider to request that a downloadable font be used with the subtitles in a DASH stream (or even to indicate that the font must be used).

��tv Research & Development's Distribution team have added support for the DVB Font Downloading mechanism into the latest release of , an open source JavaScript library for MPEG DASH streaming on browser-based devices. In addition, we’ve also provided an .

To confirm this new font downloading functionality works in dash.js and to allow other content providers to test their own MPEG DASH players, we’ve added new test material to our website, covering both live and on-demand streams. We’ve also made a number of these test streams available in the .

Each stream tests a different facet of the “DVB Font Downloading” mechanism, including amongst others:

Support for different font file formats

Situations where incorrect or malformed URLs are supplied for font files

Situations where subtitles must not be presented if the specified font is not available

These streams use a (hand drawn!) test font we created in-house called “��tvRD_Massif”. Whilst it won’t be winning any awards for design, the font has been designed to perform some useful character re-mapping. The test streams contain the words “WRONG font”; this means that when any other font is used, either because of a failed download or the client ignoring download instructions, those words will be visible. However, when “��tvRD_Massif” is correctly downloaded and used for display, the words “RIGHT font” are rendered on screen. Other useful indicators appear on-screen too, such as a timestamp and the format of the font in use.

We hope that these test streams and the addition of support for DVB Font Downloading in dash.js will enable content providers using MPEG DASH to improve subtitle delivery for their audiences. Particularly, we hope this can improve subtitle delivery on TV devices, as more and more services move towards IP delivery, for example, with .

Project Timbre: Investigating mobile coverage for live radio streaming on ��tv Sounds

2024-03-15T17:09:47Z

We’re collecting additional telemetry data in a prototype of the ��tv Sounds mobile app to explore how well mobile networks deliver live streaming radio. We call it Project Timbre.

��tv Sounds gives listeners access to the full gamut of the ��tv’s live radio stations and on-demand content – wherever they are. If this is on the move, be it as a pedestrian or in the car, the listener is likely to be reliant on delivery over mobile networks.

Today, around 91% of the UK landmass has 4G coverage from at least one operator and programmes such as the are expected to see this rise to 95% on completion.

But what does ‘coverage’ mean in the real world for the listener’s Quality of Experience? I'm working with Simon Elliott from the ��tv's Distribution & Business Development team and we both aim to find out more…

Ofcom carry out work to , and make that available through outlets including their and regular reports.

Last year’s , published in April 2022, also made recommendations for work on mobile networks and radio streaming, specifically:

R32: Industry should work closely with Mobile Network Operators to promote the build out of robust mobile data networks (5G) and deliver on-demand, streamed listener services focused on in-car listening.

R33: Following on from , radio broadcasters, transmission providers and Ofcom should initiate a programme of field-testing and trials to review and validate the Plum findings on 4G/5G coverage. The results of this testing should be discussed with Ofcom to ensure they include in their Connected Nations reporting a measure appropriate for reliable radio/audio streaming.

In Project Timbre, we are investigating these issues by focussing on the real-world Quality of Experience (QoE) for listeners using ��tv Sounds. We are concentrating on live radio as the most challenging use case - after all, you can’t download something before it has happened live. However, the concepts we’re exploring can also apply to streaming of on-demand content.

Why we are interested?

We have a high confidence in the QoE delivered by our extensive network of traditional broadcast transmitters. These are downlink-only and dedicated to live audio delivery, being built for the specific requirements of broadcast.

Mobile networks, in contrast, are multi-purpose, highly complex and built to satisfy a broad range of competing requirements. These bi-directional IP networks allow the full range of ��tv content to be made available, and offer the potential for new experiences and personalisation. Here, it’s harder to have confidence in the QoE since it depends on many factors, not least the intricate interaction between the network performance and the reactions of the algorithms in the playback client to dynamic changes in network characteristics.

However, this bi-directional nature of the mobile networks give us the opportunity to collect feedback on performance. We can then use this data to:

try to understand the impact of predictions of mobile coverage on Quality of Experience; and

optimise products like ��tv Sounds to make them the best they can be in a mobile environment

What are we measuring?

We’ve built an augmented, private prototype of the Android version of ��tv Sounds. Our engineers are using this internally to collect data on the real-world Quality of Experience (QoE) over mobile networks as they are today. We can explore variations due to location and time of day and use that data to identify the key metrics that have the greatest impact on QoE. It also enables us to have meaningful conversations with industry about how we can work collaboratively to make the QoE the best it can be and to examine the role that network features such as 5G Media Streaming could play. We’ve started with Android, since its APIs provide access to a richer set of low-level data on the mobile network than we have access to with iOS.

The image below shows what we’re doing:

The live radio stream originates at the ��tv, where it is packaged as a sequence of audio segments. These are then delivered using HTTP via internet infrastructure, Content Delivery Networks (CDNs) and the mobile network to the ��tv Sounds app running on a smartphone (or, in the future, a connected car). The sequence of discrete audio segments is then reassembled into a continuous stream of audio that is then played back to the listener.

We currently collate four types of data shown in the figure:

QoE metrics: These are derived from the audio playback client (the ��tv’s Standard Media Player) and provide information about buffering instances (i.e. pauses in the audio playback) and how far behind live the listener is;

Audio Delivery metrics: These are provided by the underlying HTTP library, used to download the sequence of MPEG-DASH audio segments that constitute the live stream, including items such as the time taken to transfer the audio data and the time taken for the initial client request to be serviced (Time To First Byte, or TTFB);

Network Quality metrics: Using the Android Telephony APIs, we can collect detailed information about which mobile network is in use, its signal strength, signal quality and the primary serving cell ID;

CDN View: Unlike the above metrics that are all reported client-side, we can use logs from the CDNs to give us a server-side view of what is happening with the connection and the transfer of each audio segment.

This data is then regularly recorded and logged against location and time of day.

The client-side data (1, 2 and 3 above) is collected using messages that are delivered to a back-end database. To avoid bias, we ensure we capture data from any areas with no mobile signal, with any messages that can’t be delivered being buffered in the client for delivery to the database once connectivity returns. We use a dashboard for real-time monitoring and debugging of the system.

In the example dashboard below, you can see the reporting of the playback state (e.g. PLAYING or BUFFERING), how far behind live the audio sits (‘smp_progress_behind_live_s’), the TTFB (‘Audio Segment Download Open Timer’) and mobile network metrics such as signal strengths (RSRP and RSSI) and cell handovers (‘cell_id_change_count’).

Quality of Experience and buffering

The audio segment duration on ��tv Sounds is 6.4 seconds. To play uninterrupted, live audio and keep up with the live edge of the broadcast, the ��tv Sounds app therefore needs to download successive MPEG-DASH audio segments on average every 6.4 seconds.

Setting aside any audio buffer on the device, failure to download any new segment within 6.4 seconds will firstly result in a pause in the audio. Secondly, the audio will resume where it left off when the pending segment downloads, albeit with the listener delayed further behind the live edge. The Quality of Experience will be impacted by both these effects.

Furthermore, the impact of this audio delay for the listener differs depending on the content being listened to, with live sport being potentially most critical – nobody wants to hear from their mates by text that an all-important goal has been scored before they hear it for themselves!

Under ‘normal’ conditions, where the entire distribution chain (including, for example, CDNs and mobile networks) performs sufficiently well, successive audio segments will typically arrive within a very short time of being requested (i.e. in the order of tens of milliseconds). Under such conditions, uninterrupted playback is guaranteed.

However, we know that playback interruptions (‘BUFFERING’ events) do happen in practice. Barring any erroneous operation of the ��tv Sounds app itself – or being in an area without any mobile coverage – playback interruptions will be the result of excessive (i.e., more than 6.4 seconds) delay in receiving the requested audio segment data.

The presence of the audio buffer in the app decreases the chance of a delayed segment resulting in a playback interruption, and allows for a seamless switch between different modes of reception such as 4G to/from WiFi. There is a trade-off between the fill level of this buffer and the tolerance to excessive delay; a fuller buffer means that the client can wait longer for a segment to be delivered before an audio interruption occurs. However this resilience comes at the cost of a longer delay behind live for the listener compared with traditional broadcast reception.

An added complication is that this audio buffer will vary in how full it is, depending on how the user has controlled playback (e.g. pausing or fast forwarding). Crucially, the audio buffer level will also depend on any previous audio interruptions as a result of excessive delay that the client has been exposed to, such as having travelling through areas of poorer mobile coverage. The net effect is that different users experience different QoE depending on their past behaviour; indeed a single user may even experience a different QoE in the same location on the return vs. the outward leg of their journey.

In summary, the listener’s QoE is a complicated product of a number of factors, including user behaviour, exposure to previous network outages or delays and the behaviour of the playback client in reaction to those.

As a result, we're interested in occasions when the reception of the next segment is excessively delayed since this has the potential to cause an audio interruption. It can also act as a common currency, not impacted by variation in the fill level of the client audio buffer.

The impact of excessive delay can be seen in the particular example below. The graphs depict data from four different handsets, each connected to a different mobile network. In this case the handsets were in a static location, bringing into sharp relief the variation in QoE that can be observed by different users even in a single place.

The additional ‘BUF_4_SEEK’ state is used to distinguish between buffering that has occurred at the behest of the listener (i.e., when changing station, fast-forwarding, etc.) and buffering that results from the performance of the underlying mobile network.

As can be seen, over the course of the particular 30 minutes depicted here, several ‘BUFFERING’ instances occur, resulting in pauses in playback, due to the relevant audio segments not being delivered across the mobile network in time. The net effect of these is a spread of audio delays from around 7 to 24 seconds, across the four handsets.

It is interesting to note that, in this example, the measured signal strength (RSRP) appears generally acceptable throughout, although it varies significantly over time. However, the Time to First Byte (TTFB) can be significant and, in some instances, peaks at a time period at or above the length of the audio segment itself. In these instances, the actual audio data (which is only of the order of a few tens of kilobytes) is often transferred almost instantaneously, potentially giving an unrealistically high estimate of transfer speed if only the time taken to receive the payload is measured.

Digging deeper

The project is currently a small-scale trial, with around twenty handsets in the hands of engineers and volunteer members of staff; there are no plans to deploy this publicly. Despite this limited scale, we already have data for around 6,500 km of road and tens of millions of recorded data points. And while each individual measurement is a snapshot of the performance in a particular location at a particular time, we can use statistical techniques - of the sorts we’ve been using for many years to analyse broadcast networks - to build up a better picture of when and where the mobile networks work.

Statistical analysis and QoE metrics

To get more statistically significant results, we first need to recognise that streaming radio/audio is different to broadcast, in particular:

Audio is segmented into chunks (of 6.4 seconds duration);

The connection speed is variable rather than a constant bit rate;

The mobile coverage for streaming of live radio does not appear to be solely determined by signal strength nor by raw transfer speed (since the TTFB can also be significant).

To develop Quality of Experience metrics, we aggregate our data points into 100m × 100m map squares (pixels) and are currently examining:

The proportion of time the audio was interrupted due to buffering - with the proviso that this will be affected by the listener’s past behaviour (see above);

The Time To First Byte (TTFB) - since this can often dominate over the transfer time of the actual audio data; and

The probability of the availability of a given capacity (bit rate) within a map pixel – this would allow us to make a projection for the expected service coverage for a given required bit rate (e.g. audio or video).

This approach also allows us to start comparing our data with other publicly available data sources and statistics. If we can better understand and incorporate publicly available datasets into our work, we can increase our knowledge of how network performance affects ��tv services. Additionally, we could encourage and explain the benefit of collecting and/or making available additional data that also encompasses the requirements of streaming services, for the benefit of the wider public and industry.

A couple of examples of our mapping into pixels are shown below. The shaded areas with a blue outline represent geographical areas which may be of interest and for which various statistics may be produced - for example, the percentage of roads, or rail that may experience various levels of network delay (i.e. TTFB, ‘Downlink Open’).

Conclusions

The data we’re collecting and analysing is giving us insight into real-world Quality of Experience and this certainly appears interesting. However, networks are complex and evolving - adding coverage, improving capacity and innovating with new architectures, capabilities and generations (4G, 5G, 6G, 7G …). Our data is therefore not the whole story.

Although we’ve only presented some snapshots of our data, they illustrate several points that we’re seeing more widely, in particular:

The impact of the fill level of he client buffer means that the use of buffering instances - i.e. audio interruptions - as a QoE metric has to be done with care;

A better understanding of the characteristics of mobile networks - in comparison to fixed networks - brings opportunities to optimise streaming audio apps for use on the go;

Sufficient mobile signal strength (RSRP) is not a guarantee of coverage for a specific service (in this case, live audio streaming);

Time to First Byte appears interesting in this context; and

Transfer speeds need to be measured with care, particularly when dealing with audio segments that may be relatively small in size, and where the TTFB can dominate over the data transfer time;

Not every listener to live streaming radio will experience an event at precisely the same time, and they will typically be somewhat behind listeners to our traditional broadcast radio networks - something our work on Low Latency Streaming is addressing. This could have an editorial impact that needs to considered.

There’s much more work to do in Project Timbre. We know, for example, that bottlenecks can and do occur at different points in the distribution chain; focussing solely on a client-side viewpoint doesn’t allow us to necessarily distinguish between these. We’re therefore currently looking in detail at the server-side CDN access logs mentioned above as a means to to see if we can start to unpick these issues. The intention is that we can then focus our research efforts on the relevant points of this chain.

As set out by the , we’re keen to work with industry and other broadcasters to better understand all these issues, and work across the distribution chain to optimise QoE and efficiency of delivery as well as explore how concepts of coverage and buffering might be better relayed to the public.

��tv R&D - 5G Media Streaming

��tv R&D - Low Latency Streaming

Mark the good stuff: Content provenance and the fight against disinformation

2024-03-07T17:38:35Z

��tv News’s Verify team is a dedicated group of 60 journalists who fact-check, verify video, counter disinformation, analyse data and - crucially – explain complex stories in the pursuit of truth. On Monday, March 4th, Verify published their first article using a new open media provenance technology called C2PA. The C2PA standard is a technology that records digitally signed information about the provenance of imagery, video and audio – information (or signals) that shows where a piece of media has come from and how it’s been edited. Like an audit trail or a history, these signals are called ‘content credentials’.

Content credentials can be used to help audiences distinguish between authentic, trustworthy media and content that has been faked, by showing them where it has come from. Ultimately, they are the ones who decide whether they trust that media. The digital signature attached to the provenance information ensures that when the media is “validated”, the person or computer reading the image can be sure that it came from the ��tv (or any other source with its own ).

This is important for two reasons. First, it gives publishers like the ��tv the ability to share transparently with our audiences what we do every day to deliver great journalism. It also allows us to mark content that is shared across third party platforms (like Facebook) so audiences can trust that when they see a piece of ��tv content it does in fact come from the ��tv.

For the past three years, ��tv R&D has been an active partner in the development of . It has been developed in collaboration with major media and technology partners, including Microsoft, the New York Times and Adobe. Membership in C2PA is growing to include organisations from all over the world, from established hardware manufacturers like Canon, to technology leaders like , fellow media organisations like NHK, and even the Publicis Group covering the advertising industry. Google has now joined the C2PA steering committee and social media companies are leaning in too: they are actively assessing implementing C2PA across their platforms.

As more organisations sign up to C2PA, people will become used to seeing images or video with content credentials included. We hope this will give more and more people the tools they need to judge the authenticity of what they’re seeing for themselves and preserve the integrity of the information ecosystem.

��tv News - Transparency tool launched by ��tv Verify

How we got here

At the ��tv, we pride ourselves on being the world’s most trusted news media organisation. We believe that ‘If you know how it’s made, you can trust what it says - trust is earned.’ We are committed to delivering impartial, accurate news to our audiences. This matters now more than ever as we face a growing onslaught of fake or misleading information.

��tv R&D has been thinking about this problem for a long time and, back in 2019, we decided to do something about it. In collaboration with Microsoft, the New York Times and CBC/Radio-Canada, ��tv R&D began exploring some early technology solutions to the problem of disinformation. We soon began to collaborate with another like-minded organisation called the Content Authenticity Initiative, headed by Adobe.

Together, we founded the C2PA, an organisation set up to develop open media provenance standards. It represents a collection of members from across many different industries, all united by the desire to represent the provenance of media accurately and transparently. To ensure that C2PA would work for news media organisations, Project Origin held a series of workshops to assemble the requirements for this technology. We learned that news organisations needed to be able to add their own data to media so they could provide the same kind of context to their readers that a news report would. They also wanted to add their own provenance data so that users could tell where the content had come from.

We have come a long way in the past five years. Today, members of C2PA all over the world are planning to make use of the specification’s media provenance features in lots of interesting ways. This could include labelling AI generated images, presenting the original context of a video when it is shared on social media, or providing a complete list of all the edits done to an image since it was taken.

We hope that this is just the beginning. C2PA was designed to represent the complete chain of provenance of all media, which, when available, will provide users with an unparalleled amount of information to help them decide whether the content they are consuming is genuine and trustworthy.

How does it all work?

Although it may seem straightforward, publishers must orchestrate a complex process before an image or video is ready to be shared online. What does this mean in practice? Let’s take an example from ��tv News. An image is captured in the field by a photographer. The photographer may upload the picture from their camera to their computer to do some light edits. These could include improving the lighting of their subject or correcting white balance. Then, they may send the edited image to an agency. There, the agency will catalogue the image and its metadata (information about where and when it was captured, an image description, etc). The agency may also lightly edit the image – perhaps cropping to better emphasise the subject, or resizing or converting the format to whatever their customers need. Then a news organisation, like the ��tv, might use that photo. Again, depending on where it’s being used, the image might be edited again (perhaps it will be used in a homepage slot that requires a square picture).

Finally, the image is made available to the public as part of a news article.

Because , any edits to the content will break it. This is on purpose – and it is what allows C2PA signed images to be so secure. It is too difficult to allow some edits to work, and others to fail, and so we require that each edit is individually signed. This, when available, provides a chain of edits that can be viewed all the way back to the point at which the camera took the photo. Each edit, signed by the entity that made them, can then let the user know what edit was performed. The spec also allows for thumbnails at each step to be recorded as well.

All of this underscores the primary intention of C2PA: to provide secure, authenticatable provenance that allows users to make their own decisions on whether content is trustworthy. There is still some way to go before this is possible, particularly when it comes to video and audio.

We should note that while this is an important tool for giving audiences the information they need to distinguish been trustworthy and untrustworthy information, it cannot be used as a fact checking mechanism to establish whether an image itself is real or fake. It simply allows users to see transparently what has been done to an image before it has been published and armed with that information, to make decisions about the authenticity of the image themselves.

We are witnessing a turning point for the news media industry. We hope this marks the beginning of a renewed fight back against online mis and disinformation, giving news media organisations another important tool for reigniting trust in the information ecosystem.

��tv R&D - Is this real? Provenance in generative media

��tv R&D - Increasing trust in content: Media provenance and Project Origin

��tv Verify

��tv News - Transparency tool launched by ��tv Verify

A second Emmy for our work on High Dynamic Range television

2024-02-29T15:05:38Z

We were immensely pleased to hear that we have won another Emmy® for our work on high dynamic range television (HDR-TV). The first Television Academy Emmy® was awarded last year to the ITU-R for Recommendation – the international standard that specifies both the ��tv/NHK Hybrid Log-Gamma (HLG) HDR-TV system and Dolby’s PQ. This second has been awarded to the ��tv for our work on developing what has become known as the “single-stream” HDR live production workflow, and the format conversion LUTs (look-up tables) that we developed to support the workflow. I would also like to note that US broadcaster NBC have received an award too, for their developments on single-stream workflows.

Statue © ATAS/NATAS

Early trials of live HDR TV production used a parallel production workflow, whereby both the HDR and standard dynamic range (SDR) outputs were taken from cameras and passed through largely separate HDR and SDR production infrastructure, to protect the SDR signal chain and the SDR audience experience. In many cases the HDR signal path supported UHD resolution (3840 x 2160 pixels) whilst the SDR signal path only needed to support HD resolution (1920 x 1080 pixels). To avoid the need to have two Vision Mixer (switcher) operators, the HDR mixer would be configured to follow the SDR mixer. It quickly became clear that the parallel workflow was both costly in terms of equipment, and difficult to configure – particularly ensuring that the audio and video were correctly timed through the various HDR and SDR signal paths. Just as it had been when we migrated from 4:3 to widescreen production, and standard definition (SD) to high definition (HD) production, if HDR production were to become mainstream, the ��tv recognised that it would be essential to run a single UHD HDR format through the production infrastructure. The SDR programme output would need to be automatically derived from the HDR programme output, without compromising the signal delivered to both HDR and SDR audiences.

Our 2018 UHD HDR coverage of the wedding of the Duke and Duchess of Sussex demonstrated that it was possible to derive a high quality SDR output from a UHD HDR production chain. At that time, the ��tv’s domestic HD SDR output was derived from a parallel signal chain, but both the UHD SDR feed for Sky and the international HD SDR programme feeds, were derived from the UHD HDR output through one of our early HDR to SDR conversion LUTs. That gave us the confidence we needed to remove the parallel HD SDR signal path from our UHD HDR coverage of the FA Cup the following year. From the quarter finals onwards, the ��tv One HD (SDR) feed was entirely derived from the UHD HDR programme output.

As my 2019 FA Cup article explained, even with the huge amount of planning and testing that we carried out ahead of the quarter final at Millwall, we encountered operational issues that we hadn’t anticipated. Of particular concern was the importance to the Vision Supervisors that the UHD HDR and HD SDR programme feeds have an identical “look”, and that the SDR programme output derived via the output HDR to SDR “down-mapping” LUT not only looked the same as that seen by the camera “racks” (shading) engineers, but also measured the same on a waveform monitor. Until that point, most R&D engineers working on the development of HDR TV had been comfortable with the HDR and SDR signals having a different appearance - after all, HDR TV had been designed to more accurately reproduce the scene in front of the camera. So, through our coverage of the semi-final and final matches we refined the workflow and format conversion LUTs to precisely meet their requirements. We developed what has now become known as the industry’s “single-stream” live HDR production workflow.

The workflow has proven to be extremely robust, reliable and repeatable. Not only does it deliver spectacular HDR images, but the SDR audience benefits too. Our format conversion LUTs have matured to such a degree that one of the UK’s most respected Vision Supervisors said to me recently, when the HDR output has been “painted” to give an artistically pleasing image, the SDR output derived from the HDR via our LUTs exceeds the quality of the native SDR camera outputs.

Throughout the development of HLG we have shared our experiences with the industry through our project website and blog posts. We have also documented the live production workflow as it developed in Report ITU-R , the and most recently the . Not only has the workflow been used by the ��tv for FA Cup football, , our coverage of the 2020 Euros and the Coronation of King Charles III and Queen Camilla, but it has also been adopted by Sky TV for their UHD HDR sports coverage; for the Olympics; for the FIFA World Cup; for their Euros match coverage; for the Wimbledon Tennis Championships and US broadcasters NBC, CBS and Fox.

We’re thrilled to have received the Emmy® and have our work recognised internationally. The Technology & Engineering Emmy® awards are given for “engineering technologies that either represent so extensive an improvement on existing methods or are so innovative in nature, that they materially have affected television”.

We couldn’t be more proud our work and grateful to our partners in , ��tv Sport and the ��tv iPlayer teams, who enthusiastically supported the technology trials that were necessary to develop the workflow. I would also like to thank our outside broadcast providers and the vision supervisors, whose patience and feedback was critical in developing the workflow. A workflow that is now proven to meet operational and artistic requirements, as well as delighting our television audiences too.

Extending our Mastodon social media trial

2024-02-13T16:12:37Z

In July 2023 we announced our plan to establish a ��tv presence in the distributed collection of social networks known as the Fediverse, a collection of social media applications all linked together by common protocols, as part of our research into social technologies. The experimental ��tv Mastodon server at has now been running for six months and hosts accounts for R&D, Radio 4, and 5 Live.

We were aiming to learn about how much work and cost this involved, how many people we’d reach, what levels of engagement we would get and to explore the risks and benefits of the federated model. The trial so far has been really effective in helping us learn about how the Fediverse is evolving, what technical support a Mastodon server needs, what the costs are, and how a large media organisation like the ��tv can engage with the many different overlapping communities that exist in this rapidly changing space.

We are therefore pleased to say that we are going to continue the trial for at least another six months while we share our findings internally and seek more engagement from other ��tv teams. We are also planning to start some technical work into investigating ways to publish ��tv content more widely using , the underlying protocol of Mastodon and the Fediverse.

So far in the trial we’ve amassed around 60,000 followers across our six trial accounts, and we have had to do very little moderation of replies associated with our content. Reassuringly, most of the comments and feedback have been positive, welcoming both our interest and the way we have set things up.We’ve had really encouraging levels of engagement (i.e. replies, re-posts and likes) on Mastodon. For some equivalent posts we’ve seen significantly larger engagement numbers for Mastodon compared to X/Twitter, particularly given the relative sizes of different platforms. We think this is partly due to the culture of Mastodon, and partly because of some of the topics we’ve posted about. Because this an experiment and a trial, it's not always the main priority for all the teams involved, so we may not be able to engage and reply as much as the Mastodon community and culture expect, and we recognise this could be an issue going forward.

We have been asked by a number of people on Mastodon to include ��tv News social accounts in the trial. Because of the potential sensitivity around news stories, we need to be particularly careful with our editorial processes and within the scope of this trial we are not in a position to guarantee time and effort from other teams outside of R&D. This means we haven't been able to include ��tv News in the trial yet, but we are working on it.

And to be clear, the account represents the R&D team working onnews technologies and doesn't represent ��tv News as a whole.

Over the next six months the trial will continue and we will continue to learn. We will make best efforts to post and engage with our audiences and add more ��tv accounts to the trial. Internally, we’ll continue to talk to teams about future plans and strategies for a ��tv presence on Mastodon and other emerging social technologies, and we’ll stay in touch with the community by publishing updates like this, hoping that they are useful and relevant to others using or considering a presence on Mastodon or in the wider Fediverse.

The last six months have been invaluable in improving our understanding of what it takes for an organisation like the ��tv to be a good member of the Fediverse, and opened up new areas for investigation, such as looking at ways of using ActivityPub as an alternative publishing mechanism for ��tv content, alongside the social uses of Mastodon.

We have been encouraged by the support we’ve had from the wider community and found a lot of interest internally. Our hope is that this work will lead to long-term ��tv engagement with the wider network of federated online services, and that we will be able to serve the communities here just as we provide public value in other online spaces.

NeRFs - A guide to creating impressive 3D fly-through videos from 2D smartphone images

2024-01-18T15:10:16Z

The ��tv Research & Development visual computing team have been collaborating with ��tv 6 Music to use cutting edge technology to support their recent annual T-shirt Day event, where listeners wearing their favourite band T-shirts can request songs.

Our brief was to create short, eye-catching videos for social media to communicate the concept of a T-shirt being used to request a song. To show how music feels, we really wanted to create something dynamic and maybe even a little surreal. We decided to create smooth moving shots to suggest the journey a song takes to reach our listeners.

It is unusual to create complicated moving shots for radio as the studios tend to be relatively small spaces, so we decided this might be a perfect use-case for .

NeRFs are neural networks that can represent real life scenes and let us move very freely around those scenes long after they are captured. They are trained using a set of 2D images from the environment (which can be created from simple handheld phone video) and, once processed, allow us to make whatever new visual paths (or moving shots) we want through that space. NeRFs are not new - we’ve been investigating them since 2020 - but recent advances in novel view synthesis algorithms mean that both the quality of the output and the speed we can create NeRFs have drastically improved. We expect the technology to continue to advance rapidly - for example, right now NeRFs cannot make video of moving subjects - but we were curious to see if we could create usable content now.

A few test shoots and many hours spent processing, rendering and re-rendering our assets later, and we had the videos below ready to go out on social media.

They tread an interesting line between showing frozen ‘moments’ but also a journey through these spaces, and we learned something new about the process of capturing or processing NeRFs while making each video.

Our guide to making NeRFs

p0gz4rr0

Capturing a video for our NeRF with a selfie stick.

Select a subject. Any people in shot mustn’t move (or ideally even blink) for 90 seconds.

Record video of the subject(s) and the space from all angles, trying not to cast a shadow or create any reflections (a selfie stick may be helpful). It does not matter if the video is horizontal or vertical or even if it’s well framed, but do set your camera up to avoid motion blur and don’t move too fast.

Use the video to train a neural network that can recreate the scene. There are several different models available which can do this, we chose , which combines elements from several recent models to produce high-quality renders. (You will find out at the end of this step if your capture was good enough.)

Decide what camera framing and motion you want to export through the scene.

Review the resulting video.

If the video turned out as you expected, congratulations! You can now put it into your edit.

If the video didn’t turn out the way you wanted, either re-export a different path (relatively fast and easy), retrain your neural network (slow and annoying) or maybe even go back and re-capture your video (sometimes the fastest way, but definitely annoying).

Using post-it notes to help storyboard and plan out 3D camera fly-through paths.

Each video took a couple of minutes to shoot, up to an hour to process, approximately 9 hours to train the network, half an hour to select our camera path, and about 4 hours to export a usable final video. As someone pointed out, this process can be a lot like developing a roll of film and having to wait to see how it comes out.

We decided to export our videos as quite slow journeys through the space - this made it much easier to vary the speed of our camera path in a video editor later.

p0gz55s0

Screen recording where images from the capture video are aligned in the 3D space - and the subsequent NeRF, once training is complete.

The first NeRF we captured was of Mary Anne Hobbs, who was extremely patient in helping us break new ground. We placed a panel light in the studio to try to ensure that Mary Anne was captured the most clearly as compared to the background. We weren‘t sure what camera path we wanted, so we comprehensively captured the whole studio, including behind Mary Anne, and facing all sides of the room, from high and low angles. As a result, when stills were extracted from the 90 second video, much of the input was not of Mary Anne, but of the studio and all the details within it.

This meant the neural network had a lot of visual information to store, not only of Mary Anne but also all the details of the room, and rendered them all to approximately the same quality. This gave us a lot of flexibility on the choice of final camera path, but our actual subject, Mary Anne Hobbs, was of lower resolution than would usually pass an editorial standard.

We concluded from this first experiment that it is important to have an idea of the kind of camera path you wish to take prior to capturing data for the NeRF. This includes what you want to focus on as well as the range of viewing angles you are looking to see, so to focus the network’s attention on regions of interest.

The was the single largest space we have worked in to date, so the capturing of Dutch band also presented some interesting challenges.

To make sure we had options when editing we tried two methods, one where we stood very close to each member of the band and captured a very short video of each of them at very close range and another using a large monopole to get a very large range of camera angles and capture the entire band at once.

For the short range video, the goal was is to avoid any movement of the subject that would blur the result and to focus on only one band member at a time to capture fine detail. This gave us very high quality video, but we didn't get to see much of the studio, and would have introduced more cuts than we really wanted.

Capturing the entire studio and band at once allowed us to plan expansive unbroken camera moves, but as expected, at the expense of details on any one part of the scene. It did mean that we could experiment with some very interesting camera paths such as this overhead-to-face-on shot, which starts in an area for which the NeRF doesn’t have a lot of information, which leads to very surreal images at the start:

p0gz5dx5

Ultimately, the video we chose was a simple push in from a high, far away vantage point to a close-up of the lead singer, something which would normally require a crane or a drone:

p0gz642d

We wanted to experiment with making a NeRF ‘come to life’ and did this by moving an additional phone camera into place towards the end of the capture, then asking the band to ‘unfreeze’ and play the chorus of the song. It was important to make sure that the final frame in the NeRF dataset matched as closely as possible to the first frame of the video. We still had issues in the edit, going from a relatively low res NeRF to a high res video, so we added a distortion effect to soften the transition between the two:

p0gz6sj6

With these takeaways from both shoots we repeated the process with another 6 Music host, Huw Stephens. We set up panel lighting and the selfie stick as before, though this time we went into this shoot knowing we wanted to see Huw’s face, his hand on the fader, his T-shirt in sharp detail, and to fly from above into the microphone. Consequently, we focused the capture on angles facing him directly, close-ups of his face, and angles over his head. We avoided capturing angles facing away from him or behind him, relying instead on the background being rendered from moments of disocclusion from shots of Huw himself.

The resulting NeRF was much crisper of our host and resulted in a much greater resolution on his face and T-shirt. We could still exhibit the nonphysical camera paths NeRFs are capable of, but focused within the tighter space around Huw that we had captured. We effectively prioritised quality over quantity, as more of the space in the neural network was dedicated to details pertaining to aspects of the scene we actually wanted to see in the final render, namely, Huw.

We took these lessons into the final shoots of the listener at home and at the tram stop who were captured in a small space, with most of the capture focusing on the subject. Our later NeRFs are of markedly better quality.

Why these collaborations are useful

It is extremely valuable to us as a research department to do these sorts of production tests. They tell us how close current technology is to being applied in the industry, but also allow us to accelerate our understanding, innovate on existing technology and prioritise work on the developments we think will be most useful.

Using nerfstudio to make NeRF content.

For example, we currently plan camera moves with , which allows us easy access to many of the current state-of-the-art methods for producing NeRFs. We then edit our exported video inside a Non-Linear Editor (NLE). However, it would be interesting to consider where in the production workflow the path through a NeRF might best be decided and controlled. There are integrations which allow some control of the NeRF in either animation software or virtual production technologies such as or . What the editing process would be like if the camera angle could be controlled inside an NLE such as or instead? That is the point at which you can see the creative choice in context, so that would be incredibly powerful as it would allow us to re-make a shot to work better with the rest of the edit, and to see the effect of that swiftly.

Working with production colleagues helps us understand which creative possibilities spark the most editorial interest and what the future might hold. Right now, the processing cost of making a moving NeRF is prohibitive, but this may change quickly, and if that happens, we can see many practical and creative uses for this technology.

UHD HDR production architecture for the Coronation of King Charles III and Queen Camilla

2024-01-18T11:42:47Z

The Coronation of King Charles III and Queen Camilla in May of this year was by far the biggest UHD High Dynamic Range (HDR) production the ��tv has undertaken. It was managed by ��tv Studios Events, who are well accustomed to covering large events of State. In recent years, they have produced the ��tv’s coverage of the late Queen’s funeral, her Platinum Jubilee and two Royal Weddings. Our role was to ensure that ��tv audiences received the very best quality pictures on the day.

For the Coronation, over 100 UHD HDR cameras were deployed, covering the procession between Buckingham Palace and Westminster Abbey, Wellington Barracks and inside Westminster Abbey itself. That is very similar in size to the TV coverage of the US Super Bowl but, unlike the Super Bowl which was produced in UltraHD (1080p HDR), The Coronation was produced in full UHD (2160p) HDR. Producing in full UHD HDR adds considerable complexity to such a large-scale production, but it was considered important to produce and archive such a historic event in the highest possible quality.

My colleague Simon Thompson described how our coverage utilised seven outside broadcast (OB) trucks from four different providers – one covering the abbey, five covering the procession route and Wellington Barracks, and a final ‘presentation’ truck covering the ��tv studio at Canada Gate and the main ��tv and international programme feeds.

��tv Research & Development were approached by ��tv Studios in October 2022, to advise on the production workflow and help with equipment approval and configuration. With different equipment being used in each of the seven outside broadcasts trucks, we needed to ensure consistent results from each production unit.

The entire production used what’s become known as the ‘single-stream’ UHD HDR production workflow, that we developed through our UHD 2019 FA Cup trials. It is, of course, based on the ��tv/NHK HLG (Hybrid Log-Gamma) HDR format, standardised in the Emmy Award winning Recommendation ITU-R . It has now been adopted worldwide, for live HDR TV production and a simplified version is illustrated below.

HDR cameras are controlled by vision engineers looking at a standard dynamic range (SDR) shading (or racking) monitor, fed with the HDR camera’s output via an HDR to SDR ‘down-mapping’ converter. Only the UHD HDR camera signal is passed through the production switcher (vision mixer). Any SDR outputs available from the cameras are ignored and not made available for the production, although they might be used to drive the camera’s viewfinder. Exactly the same type of HDR to SDR converter takes the switcher’s HDR ‘programme output’ (PGM Output) to create the HD SDR version of the programme. So, the HD SDR signal that is viewed by the largest audience is identical to that seen by the vision engineers who are crafting the look of the pictures.

The OB trucks may only have a single UHD HDR ‘check’ monitor to allow accurate adjustment of the UHD cameras’ ‘detail’ (sharpness) settings and ensure that the UHD signals are in focus and not too noisy – adjustments that would be very difficult on an HD SDR monitor. Some trucks also provide UHD HDR ‘Programme’ and ‘Preview’ monitors for the director, but that’s not essential. Most often, the entire production gallery monitor ‘stack’ is running in SDR.

In practice, most UHD HDR cameras can provide both UHD (2160p) HLG and HD (1080p) HLG outputs. To reduce the need for costly UHD HDR to SDR down-mappers, at the expense of some additional cabling, the 1080p HLG camera output is down-mapped for the camera shading. As the shading monitors are usually 17" displays, 1080p resolution is sufficient.

The workflow is documented in greater detail in Section 7.2.2 of Report ITU-R and the EBU’s Tech Report . The workflow is also used by Sky for all of their English Premier League coverage and a number of other sports too; and outside of the UK it is used by for the Olympics, for the FIFA World Cup, UEFA for the Euros and US broadcasters CBS, NBC and Fox. It is extremely robust, reliable and repeatable and is also straightforward for operational staff, as critical camera control is performed in exactly the same way as for a conventional HD SDR production.

Workflow improvements

Features added to HDR production equipment since our 2019 FA Cup trials have significantly simplified the workflow.

Back in 2019, we needed to use a mixture of ‘scene-light’ and ‘display-light’ HDR/SDR format conversions, adding complexity and increasing the risk of operational errors:

‘Scene-light’ conversions are used for matching HDR and SDR cameras, as they operate by calculating the light falling on a camera sensor. As HDR and SDR cameras have a different ‘look’ (colour saturation and tone), scene-light conversions change the look of signals too.

‘Display-light’ conversions, however, attempt to preserve the ‘look’ of an image through the conversion process, by basing their calculations on the light reproduced by a reference HDR or SDR displays.

Although not necessary for the Coronation coverage, we now have HDR slow-motion cameras and 10-bit slow-motion replay servers, which was not the case in 2019 when slow-motion replays were restricted to 8-bit SDR. So, the need to operate and colour match a mixture of HDR and SDR cameras has greatly reduced. Consequently, we seldom need to consider ‘scene-light’ conversions now and almost all format conversions are based on display-light. The one exception is where a specialist camera is required (e.g. a miniature camera in a goal post) which might only be available in SDR. If that’s the case, a scene-light SDR to HDR conversion will still be needed.

HLG HDR is now widely supported in non-linear editing systems which meant that for the Coronation, we could run the on-site post-production facility in HLG HDR, further reducing the need for HDR/SDR format conversion.

HDR to SDR down-mapping has greatly improved too, and often uses the extended signal range above SDR nominal peak-white (10-bit code value 940) to convey some of the highlights from the HDR cameras. That in-turn has led to a reduction in SDR > HDR > SDR ‘round-trip’ losses, which is important when SDR content items such as graphics and archive are included in an HDR programme. We’ll take a closer look at that in the next section.

Important for our audiences, HDR camera ‘painting’ controls, which allow a vision supervisor to craft the artistic look for an event, have also matured. These are discussed in greater detail later too.

HDR to SDR format conversion

Critical to our use of the ‘single-stream’ HDR production workflow, are ��tv R&D’s HDR to SDR format conversion ‘3D-LUTs’. These are three-dimensional ‘lookup’ tables which map an HDR input RGB triplet, to an SDR output RGB triplet. For 10-bit RGB input signals, you might assume that you would need a lookup table with two to the power thirty entries, but that would, of course, be huge. Typically, the R, G and B signal ranges are divided into 33 or 65 values each, and table entries provided for all combinations of those values. The missing output values are interpolated. So, it’s easy to encode a complex conversion algorithm into the LUT, without needing expensive dedicated hardware.

Unlike some other LUTs, our LUTs use a ‘non-linear’ conversion algorithm ensuring the HDR image and the SDR image, when shown on a reference 100 cd/m² nominal peak luminance SDR display, look subjectively similar. Even when monitoring just the down-mapped SDR camera signal, vision engineers can be confident that they’re creating spectacular HDR images. It’s important to use a non-linear converter, because the brightness of HDR and SDR images viewed in TV production are different. Table 1 of Report ITU-R specifies a nominal HDR Reference White level (the diffuse white) of 203 cd/m² for HDR images (75% HLG), whilst Recommendation ITU-R specifies a nominal peak white level (similar to diffuse white) of 100 cd/m² for SDR production. A non-linear converter can take account of the non-linear response of the human visual system to the different brightness HDR and SDR signals, whereas a simple linear converter cannot. A linear HDR to SDR down-mapper will usually yield subjectively darker mid-tones and shadows to those seen in the HDR image - if you’re crafting the ‘look’ of the down-mapped SDR image, the HDR image may look overly bright and ‘sat-up’ in comparison.

HDR to SDR format conversion has improved enormously in recent years. It is now at the point where the SDR signal derived from an HDR original is significantly better than the SDR signal usually obtained from a conventional SDR camera. Furthermore, when an HD signal is derived from a UHD camera, the quality of the HD is usually better than that obtained from an HD camera. Thus, the HD SDR audience benefits from the UHD HDR production too.

For the Coronation, we required the use of the very latest ��tv R&D conversion LUTs. The HDR to SDR down-mapping LUT used for the critical camera shading (racking) was our display-light down-mapping LUT 9c. The same down-mapping LUT was applied to the UHD production switcher’s (mixer) output in the final ‘Pres’ (Presentation) OB truck at Canada Gate, to create the HD SDR programme.

SDR to HDR format conversion

Today, and for many years to come, it will be important to be able to include SDR content into an HDR programme. The Coronation was no different, as extensive use was made of archive material throughout . In 2019, we used a technique called ‘up-mapping’ to convert SDR content to HDR. The technique places SDR content into an HDR signal container, and gives a slight boost to the SDR ‘highlights’ so that the SDR content more closely resembles the look of native HDR. The technique works well with carefully prepared drama or movie content, but gives variable results with live cameras and older archive material which can have heavily ‘clipped’ highlights. Those clipped SDR highlights can lead to large, ugly and overly bright regions in the HDR version of the content. They were, however, useful conversions as the ‘up-mapping’ tone-curve could be designed to complement the ‘down-mapping’ tone-curve in the final HDR to SDR conversion on the programme output, thereby reducing the end-to-end ‘round-trip’ losses.

However, for Fox’s 2020 coverage of the US Super Bowl, a different approach was used. Until then, it was common practice for HDR to SDR converters to compress the entire HDR signal range into the SDR signals’ nominal range (10-bit code value 64 to 940). But for the Super Bowl we were asked to produce a down-mapping LUT that placed the compressed HDR highlights into the SDR signal’s headroom, above nominal peak white (10-bit code value 940), in a signal range known as the ‘super-whites’. By doing so, we were able to increase the signal value that the HDR Reference White level (75% HLG) maps to in the SDR signal output, from 86% to 95% SDR. We were a little nervous doing this, as we had previously seen problems with equipment when using the signal range below black (below 10-bit code value 64). However, extensive testing, which we repeated just ahead of the Coronation, showed no issues with either our traditional broadcast or iPlayer playout and distribution chains. For good measure, we actually restrict the SDR signal range to the EBU ‘preferred’ signal range of -5%/+105%, rather than exploit the entire 10-bit signal range.

Not only does this new type of conversion give brighter SDR images with less heavily compressed highlights, but it allows SDR content to be included into an HDR programme using ‘direct-mapping’ (i.e. without applying the risky highlight boost) where the 100% SDR signal level is directly mapped to the HDR Reference White level of 75% HLG.

The figure below shows how we’re able to achieve a ‘round-trip’ SDR highlight loss of just 5%, using the ��tv’s display-light SDR direct-mapping LUT 3 and down-mapping LUT 9.

Camera painting

Different programme genres and geographical regions have their own preferred ‘looks’. By that I mean some genres, such as football, require a ‘punchy’ colourful image with a great deal of contrast, while others like drama may require a less punchy, more desaturated image. Because of those differences, during the ITU-R standardisation process for HLG, it was difficult to agree the default ‘reference’ look. In the end, what is written into the ITU-R BT.2100 standard is a natural look that’s close to nature in terms of its representation of tones and colour. Some professional cameras even refer to it as the HLG ‘natural’ look. It is not as colourful as the SDR production standard BT.709 nor is it as ‘punchy’, but it does provide a neutral image that can then be ‘painted’ to deliver alternative looks. As the name suggests, ‘painting’ controls within a camera allow the vision supervisors to paint the images to deliver the desired aesthetic.

For such a historic production, it was important that we gave the vision supervisors all of the tools they needed to craft the very best images. We understood from previous trials that, for this type of production, the vision supervisors may not be entirely happy with the default ‘natural’ look that HLG provides. So, we specified that all cameras used for the production had to offer a full set of ‘painting’ controls. Typically, these controls allow adjustment of the brightness of shadow detail, mid-tones and colour saturation. The photo app on your mobile phone probably offers similar adjustments. The adjustments were used to great effect to adapt the varied outdoor lighting conditions on the day of the Coronation, and to ensure that the deep, rich colours of the Kings’ robes were faithfully captured within the abbey.

As is often the case, two radio cameras were added late in the day, and they could not support the full set of camera painting controls. We’ve been working within the EBU to create a custom HDR ‘user gamma’ LUT that gets close to the traditional BT.709 look often preferred in Europe. These can often be loaded into cameras that don’t yet have a full set of painting controls, to achieve a better match to painted cameras than the default BT.2100 HLG look. It’s been included in EBU Members’ supplements to the EBU’s ‘Baseline HDR Camera Painting Controls’ specification, .

Coronation signal routing

For such a huge TV production, it was important to keep signal routing as simple as possible. Each OB truck had its own director and provided a finished UHD HDR programme for their section of the parade or abbey coverage. The Pres truck then mixed between those feeds and the ��tv’s studio cameras at Canada Gate, to produce the final UHD programme output. As described above, this was then down-mapped in the Pres truck from HDR to SDR using our LUT9c, and converted from UHD 2160p to HD 1080i, to provide the HD SDR programme output.

As a confidence check, the 1080i HD SDR programme output created by the Pres truck was fed back to the other OB trucks to allow them to compare against their locally generated HD SDR signal. This also allowed the vision supervisors in each truck to ensure their pictures were consistent with those of the other OB trucks. The whole set-up worked incredibly well, with the 1080i HD SDR ‘return’ programme feed allowing any format converter configuration errors to be quickly spotted.

A word on HDR camera shading in SDR...

Some in the industry, particularly in North America, have started using linear down-mappers with this workflow, and increasing the peak luminance of the critical SDR shading monitor from the standard 100 cd/m² to 203 cd/m². By doing so the SDR monitoring better matches the brightness of any HDR monitoring. The linear scaling in the down-mapper from the HDR Reference White of 203 cd/m² to the usual SDR white level of 100 cd/m², is complemented by a linear scaling in the display from 100 cd/m² to 203 cd/m². So, the appearance of mid-tones and shadows on the 203 cd/m² SDR display are very similar to those in the HDR original image, without the need for the more complex non-linear down-mapping technique. It is argued that modern consumer TVs show SDR images at a peak luminance that’s closer 203 cd/m² than the 100 cd/m² specified in Recommendation ITU-R BT.2023, so we should do the same in programme production. Two vision supervisors working on the Coronation asked me what the ��tv thought of the practice.

At first glance increasing the luminance of the critical SDR shading monitor might seem like a sensible move, but we and many in our industry have serious concerns. For a given signal, if you adjust the contrast on a monitor to deliver 203 cd/m² rather than the usual 100 cd/m² you will see more detail in the shadows and, because of the non-linear response of the human visual system, the appearance of mid-tones will change too. As a result, a vision engineer will make different artistic adjustments to a camera, depending on whether they are viewing on a standard 100 cd/m² or 203 cd/m² SDR display.

All in the industry recognise that few (if any) consumer TVs produce a picture ‘out of the box’ that’s close to the image seen in the TV production environment. But we appreciate that TV manufacturers have put a great deal of effort into optimising their different picture modes to deliver images that are liked by the consumer and work well in the home. Those optimisations were, of course, performed assuming a standard SDR signal crafted on a 100 cd/m² BT.2035 display. If content producers now change the nature of the signal they distribute by adopting 203 cd/m² monitoring instead of 100 cd/m² monitoring, it’s hard to predict how those TV picture modes will perform and what will be seen in viewers’ homes. Moreover, if the practice were to become widespread, as viewers ‘channel-hop’ between traditional 100 cd/m² broadcasters and 203 cd/m² broadcasters, the appearance of the SDR images would change and they may find themselves wanting to make picture adjustments on a channel-by-channel basis.

For HLG, we put a great deal of effort into ensuring that the display EOTF (electro-optical transfer function) could deliver subjectively similar images on HDR displays of different peak luminance. But that’s not the case with the older SDR BT.1886 EOTF used in professional SDR monitors. It is for that reason that the ITU-R specify a fixed nominal peak luminance of 100 cd/m² in Recommendation BT.2035 for critical SDR monitoring in production.

The use of 100 cd/m² might seem a little old-fashioned these days, particularly as the practice dates back to the use of cathode ray tube (CRT) displays which would defocus at higher brightness. But in international programme exchange, signal consistency is paramount. For this reason, we at the ��tv firmly believe that critical SDR production monitoring should remain at 100 cd/m² to ensure consistent signals from different programme providers.

The big day, Saturday, 6 May

Part of me wishes that I could say I was at Canada Gate at the heart of the ��tv’s operation on the day of the Coronation; just as I had been for the wedding of the Duke and Duchess of Sussex, our first ever live UHD HDR production in May 2018. But the truth is, thanks to all of the preparation, our work was done after we saw the UHD HDR Coronation preview programme go to air on ��tv iPlayer and ��tv One HD at 7 pm the day before. The UHD HDR workflow, technical approach and equipment configuration were proven, and any issues that arose the next day would be operational and unlikely ones that I, as an R&D engineer, could assist with. As space within OB trucks is at a premium, and the workflow almost now BAU (business as usual), I left my colleague Simon Thompson to provide cover on the day itself. I was actually very happy to be able to enjoy the spectacular UHD HDR images via ��tv iPlayer, with friends and family at home.

��tv Research & Development - 2023 Highlights!

2024-01-23T13:13:36Z

It's often traditional at the end of the year to look back at the last 12 months - and we are no exception!

See some of our big achievements over the last year in this round up - and as we're always working on the development of media and technology, this digest will give you a head start on where things are headed over the next year or two...

Live Next Generation Audio trial at Eurovision

The  was . So it felt appropriate to continue this tradition by using the event to experiment with the latest audio delivery technology. Having it hosted in the UK also made it an attractive event to perform some internal technical trials. We were able to trial the option of choosing between the ��tv One and Radio 2 commentaries using the user interface on the TV or a smartphone, demonstrating how the content can adapt to different devices.

We've been working with the wider industry to improve audio experiences for our listeners through interaction and personalisation of the audio presentation. For example, you may wish to listen to the narration in a different language; or reduce the level of background music. The experience can also be improved by using immersive audio which is a step-up from surround sound. The audio adapts to the available speakers on the output device and can envelop the listener in sound from all three dimensions. We also announced  which lets audio producers and engineers create immersive and personalised Next-Generation Audio content and experiences.

5G at the King’s Coronation

Over recent years news crews have increasingly relied on mobile networks to get pictures from the heart of the action, they offer a great way to get to places that you just can’t reach with a satellite truck or cable. This means that there is computer hardware and kit available to broadcast from anywhere you can get a mobile signal. While this is OK most of the time, at big events the large mobile networks can get saturated with data very quickly as everyone tries to upload content to social media and journalists compete to send their pictures back to news channels.

��tv News approached ��tv Research & Development following our successful trial of 5G Non-Public Networks (NPN) at the Commonwealth Games last year and asked if we could help solve this issue. The challenge was a big one - could we provide a private 5G network that was available for the days leading up to the event and during the Coronation itself? We wanted high uplink capacity over a large area which we could offer to news broadcasters from around the world. It led to what was the largest temporary private 5G network of its kind ever deployed.

The trial - the broadcast technology industry's showcase for innovation which "[pushes] the boundaries of live and linear content creation and delivery."

UHD HDR

Image above of  (cropped) by  on , .

��tv R&D ensured that the ��tv iPlayer UHD service provided the very best pictures of the Coronation of King Charles III. Beginning work with ��tv Events late last year to provide UHD HDR production expertise for the live Coronation programmes, we worked with the ��tv Studios Production team, manufacturers, and our Outside Broadcast providers to ensure that the coverage of this historic event was of the highest technical quality.

Engineering, Science & Technology Emmy® Award for Hybrid Log-Gamma (HLG)

Statue © ATAS/NATAS

��tv R&D's Andrew Cotton was named as the ��tv representative, amongst the four key developers, in the  for Recommendation .

Human values

We want to understand people's complex and nuanced needs, helped by our ‘Human Values and Digital Wellbeing’ research projects, and through our experimental approach to qualitative research. As we realise the limits of traditional user research methods, we have been exploring new ways to engage with participants, encouraging them to share their most authentic perspectives through deliberation and discussion. We see a shift in the types of knowledge we have been collecting, away from tactical observations to more strategic patterns in behaviour.

We have used this to expolore  as well as attitudes towards the future of travel, plus views on its societal impact.

Responsible AI development

The media industry is not alone in facing challenges to innovating responsibly with artificial intelligence. Working out how to tackle the ethical and practical questions is the role of a new research programme - BRAID, or ‘Bridging Responsible AI Divides’ - which brings insights from the arts and humanities to bear on today’s rapid technical development. As a core partner, the ��tv hosted the BRAID launch event, bringing together a diverse community of policymakers, artists, academics and industry representatives. So what did we learn?

Responsible innovation

Data and technology affects all aspects of the ��tv, from our editorial operations and outputs to how audiences can discover and access our services and content. The ��tv needs to be informed early about the editorial, ethical and social implications and impacts of uses of data and emerging technologies in order to innovate confidently and use these technologies in the public interest. We works with experts across the ��tv, academia and industry to deliver timely research to help the ��tv identify, understand and respond to these challenges.

Flexible media

p0f8xhj4

Imagine a world where the ��tv makes content that’s perfect - just for you.

Content that’s tailored for your circumstances, preferences and devices. Programmes that understand your viewing habits, and flex to fit. Experiences that reflect the things you love, and offer extra information just when you might need it.

Speech-to-text

p0f8wmmd

Improvements in machine learning have allowed us to train our own speech-to-text system. It’s found a myriad of uses across the ��tv, from searching the archive to improving social media shareability.

Media provenance

AI-generated media affects our confidence in the authenticity of the media we now consume online - we now need to be able to answer the question: "Is this genuine?". Answering this means knowing where an image came from and what has happened to it after the picture was created. We are working with partners from across industries to produce an  and the latest version includes tools to help identify the origin of AI generated images.

Adaptive podcasting

What if you could make a podcast that knew a little bit about your user or their surroundings - what are their interests, are they listening at home or on the go? How might the date or time of day change the way the podcast sounds - what are the light levels of the room, is it a sunny or rainy day outside? What if the story could lengthen or shorten depending on how much time the user has to listen - are they on a long walk or a quick trip to the shops? Providing more personal audio experiences could help the ��tv to better meet audience needs but giving each listener programmes and content that feel as if they have been made specifically for them, is not without its challenges.

We have open sourced our Adaptive Podcasting code base and an accompanying learnings document as we've been exploring the concept of individualised and contextualised audio experiences.

��tv News Labs is an innovation incubator charged with driving innovation for ��tv News. We explore how new tools and formats affect how news is found and reported, share insights into our new ways of working.

Future technology trends

Late in 2022, we began to compile a list of technologies that we should be paying attention to and make some recommendations about their adoption to the wider ��tv. We interviewed twenty-two people from the fields of science, economics, education, technology, design, business leadership, research, activism, journalism, and many points between. We spoke to people from both inside and outside the ��tv and around the world. All of these people have a unique view on the future, and our report teases out the common themes from the interviews and compiles their ideas about how things might come to be in the near future.

Low carbon graphics

��tv Research & Development's Blue Room monitors consumer technologies' impact on the ��tv and its audiences. This includes evaluating modern televisions and their features, including energy consumption.

Asking an initial question, 'how much energy do televisions use?' led us on a journey to develop and implement a new idea that we called 'Lower Carbon Graphics' (LCGfx), which we believe has already saved energy in homes across the UK. Many modern televisions include energy-saving features as standard. We wanted to see if ��tv content could take advantage of these characteristics, and reduce energy consumption.

Innovation Labs

We are evolving a programme of Innovation Labs in a model we call the 'Labs Framework' as a way of bringing people, ideas, and technology together. We believe the new model will deliver significant benefits to our audiences and help us adapt our world-beating services for a changing and competitive world when audiences are fragmented and budgets are constrained.

Streaming latency

Viewers watching live television over the internet today are typically seeing the action with a delay of 30 seconds or more. In future, it won’t have to be that way. We’re working on reducing the ‘latency’ of internet streaming to match that of broadcast by streamlining the encoding and distribution chain and using new techniques enabled by the MPEG DASH and CMAF standards. We’re putting together a low-latency end-to-end system for prototyping new approaches to low latency and to test their performance. We’re also working to understand the network conditions that viewers experience and to model how low-latency streaming will perform under those conditions.

This year we made some enhancements to , including adding live versions.

Public service internet

Over the past twenty years television and radio have been complemented by the internet, which has become a vital communications platform for the delivery of , not least because it can be used for much more than delivering radio and television programmes to connected devices, important though that is. The two-way nature of the network creates a very different space from the one-way broadcasting model, and dynamic publishing tools like the World Wide Web allow the internet to host material that engages audiences in new ways.

Today, the scope for using the internet for public benefit rather than to serve purely commercial or government interests has grown. We are thinking about what this might mean and how the ��tv could help create an internet that more easily supports the online ambitions of public service organisations of all types, around the world.

Artificial intelligence for production

How could computer vision and machine learning be used to assist in the production of television? One area in which we have experimented with these ideas is on the Watches (Springwatch, Autumnwatch and Winterwatch) series of programmes produced by . Here, we have helped the monitoring of the video and audio feeds coming from the many cameras that the production team have placed out in the wild.

This year we created Wing Watch - getting some of the data we provide to the production team out to the audience to enhance their experience with the feeds of wildlife cameras by adding data and highlights on the stream.

Improving picture quality using machine learning

The ��tv has always pushed boundaries to achieve better video quality for both streaming and broadcasting - one example is the ��tv’s contribution to the Ultra High Definition (UHD) standard. Many TVs now display broadcasts at 100Hz or more. Generally, broadcast content is recorded at a lower frame rate. Frame interpolation algorithms are deployed in new TVs to ensure that such content is played are at the frame rate required.

One problem with this is that interpolated frames produce a lot of motion blur on some TVs and this detracts from the programme. Traditionally, the interpolated frame is generated by computing the motion in-between frames and using this information to warp the input frames. This approach has worked very well in the past but handling large motion and changes in brightness and occlusions (where one pixel appears in a frame but not the other) is problematic. Artificial intelligence interpolation algorithms allow this problem to be mitigated and challenging sequences can be handled well by the proposed model we have developed.

Viewing recommendations

Modern audiences that are used to streaming platforms expect content to be high quality, but also relevant to their personal taste. Editorial teams juggle a variety of objectives in creating recommendations, for example ensuring that ��tv iPlayer content is diverse and aligns with ��tv values, while promoting specific shows to specific audiences. A lot of thought goes into this curation, but it can be difficult to anticipate what audiences end up seeing because personalised recommendation systems, at least in part, reorganise the content. Now a tool developed in ��tv R&D can visualise the trade-offs that recommender systems might make when improving recommendations for one group of users might reduce the performance in another group.

Adding languages, live and low-latency: our new and updated streaming media test feeds

2023-11-08T00:18:43Z

The ��tv Research & Development Distribution team provides . Test content like ours is extremely important in testing HTTP based adaptive bitrate systems such as MPEG DASH and HLS where the client controls the presentation of the stream based on a 'manifest' retrieved from the content provider's server. Incorrect interpretation of the manifest can lead to viewers getting the wrong thing (for example audio description when they didn’t want it), and erroneous requests can create problems for serving infrastructure or even other clients.

Below are details of some recent enhancements that we’ve made to , including adding live versions.

We now have these variants:

A one-hour VOD stream

A standard live stream mimicking a continuous simulcast channel

A low-latency live stream

A set of webcasts – live streams that have a beginning and an end

CMAF

The media has been updated to be fully conformant to – the Common Media Application Format published by MPEG (The Moving Pictures Experts Group – an ISO and IEC joint committee). We were closely involved with the development of the CMAF specification and are keen to promote the interoperability it offers. CMAF allows media to be interoperable between (Dynamic Adaptive Streaming over HTTP) and (HTTP Live Streaming) streaming protocols. The ��tv uses MPEG DASH, and specifically the , for the majority of our streaming clients, and uses HLS to stream to Apple devices. With this change, we have included both MPEG DASH and HLS manifests, referencing the very same media segments.

More about the live streams

The are designed to test clients’ abilities to make requests within the advertised segment availability times – early or late requests are clearly identified in the media segments served. Incorrectly operating clients can increase server load and pollute caches with 404 (not found) responses, which could further impact other clients. Most existing test content doesn’t check for this type of client-server interaction, meaning problems can only be detected by analysis of server logs or intercepting network traffic.

Segments requested at the wrong time have an in-vision (or sound) notice.

In-vision and in-audio times (in GMT) allow the user to see the latency being achieved through the delivery chain by comparing these to wall clock time. The streams have availability times such that media becomes available as if it had no encoding delay (but the server still assembles complete segments before making them available).

Low latency

We continue to research low latency streaming, both working with standards bodies and creating prototypes. Low latency techniques allow streaming to achieve a latency which is similar to broadcast, with latency meaning the time between something happening and the audience seeing or hearing it.

We have now created a new of our live test card stream. This uses the same long GOP (Group of Pictures) encoding, but with each segment formed of 4 CMAF chunks. The MPD (DASH Media Presentation Description) indicates that the client can request segments early, and if it does this the server will send each chunk as it becomes available. Within each MPD is a Service Description element which sets the target latency for the client, as well as bounds on playback rate when catching up to that target. We have set the target to a value which we think is achievable under typical network conditions by clients optimised for low latency playback. When combined with an encoding delay, this would allow DASH streaming to achieve parity with broadcast latencies. We serve the streams through a CDN, mirroring the distribution we use on our main services to deliver to millions of simultaneous viewers. As with the other streams, in media time announcements allow latency measurement against wall-clock time as well as confirming synchronisation between different components.

Webcast streams

In addition to our live streams are a set of '' or '' . We have generally found that there isn’t much content around which can be used to verify clients behave correctly with live streams that have a start and end. This improves the audience experience by ensuring a clean start and end to playback, as well as avoiding multiple re-requests for segments that don’t exist. Our new streams are advertised 30 minutes in advance with a manifest showing an availabilityStartTime in the future. This manifest initially describes a continuous stream, without a finite duration, but as the stream approaches the end, the manifest updates to include a duration and inband events are inserted into the media, signalling a manifest update. Properly implemented clients follow this signalling and stop at the signalled point. Media requested after the end of the stream will be served, but will clearly indicate in the presented media that it shouldn’t have been requested or played. There is one webcast starting every 5 minutes and they last for just over 12 minutes.

Languages and access services

Finally, we created new representations with audio and subtitles in different languages as well as those containing audio description. We have added all of these to some manifests - giving a full list of options to aid the development and testing of clients with track switching capabilities. Other manifests have a single language for clients without those abilities.

We believe these updated test streams provide both a useful testing resource for the industry to improve interoperability, and a demonstration of the capabilities of the streaming protocols.

5 things we learnt about the barriers we face and bridges we must build to ensure Responsible AI

2023-11-08T00:31:39Z

The media industry is not alone in facing challenges to innovating responsibly with artificial intelligence (AI). Working out how to tackle these vital ethical and practical questions is the role of a new research programme - BRAID, or ‘Bridging Responsible AI Divides’ - which brings insights from the arts and humanities to bear on today’s rapid technical development. As a core partner, the ��tv hosted the BRAID launch event, bringing together a diverse community of policymakers, artists, academics and industry representatives. So what did we learn?

A Responsible AI (RAI) ecosystem can only be achieved if we all work together

Professor Shannon Vallor described how many corporate leaders are seeking to redefine RAI as a narrowly technical challenge of AI safety, but what's missing is the human question of what we collectively deserve and want from AI, including justice, accountability, equity and liberty to know, create and flourish. We need a richer understanding of the kinds of societies people want with AI - and that's where the arts and humanities can help us go beyond merely limiting harms to co-construct humane visions, knowledge and practices for AI.

Co-Directors Shannon Vallor and Ewa Luger took a critical look at RAI progress to date

We're at a familiar inflection point, Professor Ewa Luger said, where we have proliferating sets of high-level ethical principles but they are not stopping predictable harms and impacts on people's livelihoods - generative AI is a case in point. We see actors in AI innovation being 'just responsible enough' to keep society pacified and at the same time, the 'human and the humane are being stripped out' so technical developments can progress. Supporting new voices from the next generation of thinkers underpins BRAID's approach to supporting the future leaders of RAI.

We've been here before!

Dr Rumman Chowdhury took us through a potted history of RAI and how we've seen the same narratives (Is AI alive? Will AI take our jobs? etc.) recur with generative AI, seeming to pause the progress that had been made in industry practice and governance, government investment and policy development, and resourcing of civil society and independent third parties. She said the toys have changed, but the rules of the game have not - power still remains concentrated in the hands of the few.

Rumman Chowdhury said we don't need to reinvent the wheel for RAI - we can expand existing and ongoing work:

"What people want is to be able to take advantage of the technology. What is blocking them is not the technology, it's the broken institutions. So the fear every photographer, screenwriter, actor, etc., has is not that artificial intelligence will take their jobs. It's that corporations will put them out of a job by using this technology. Even though there are pathways forward that would actually enable and ensure everybody to be gainfully employed - for all of us to be benefiting from the technology."

Responsible AI requires more than technical safety

Panellists discussed how building good societies where AI works for everyone requires deliberately and effectively bringing in the voices of those currently excluded from the conversation, including diverse publics and underrepresented groups. It involves normative questions about if, when and how we should use AI and importantly, we need to be able to say no to deploying these technologies when necessary. Finally, RAI is about practice and processes that need to be cultivated and iterated over time.

Left to right: Andrew Strait, Ada Lovelace Institute chaired Panel 1 with Ali Shah, Accenture, Helen Kennedy, University of Sheffield, David Leslie, Alan Turing Institute, Dawn Bloxwich, DeepMind, Stephen Cave, University of Cambridge

Existential risk framings are a distraction from the here and now

AI systems are already impacting people's lives in important ways and diverting the conversation to future potential harms can distract from the need to intervene in AI innovation now. ChatGPT and the recent wave of generative AI have brought critical questions of power into public discourse and this can represent an opportunity to refocus the debate away from existential risk, instead amplifying new voices that work to evidence and explore the real impacts.

Left to right: Panel 2 was chaired by Atoosa Kasirzadeh and included Abeba Birhane, Mozilla Foundation, Yasmine Boudiaf, Independent Artist, Arne Hintz, Cardiff Data Justice Lab, Carolyn Ten Holter, Oxford Internet Institute, Jack Stilgoe, UCL

Our futures are not determined: Power matters but so do moral and creative imaginaries

Panellists discussed how creatives are both inspired and experimenting with AI capabilities, and simultaneously feeling anxiety and fear about their jobs and professional futures. Looking back for historical lessons (e.g. interrogate the material infrastructures, avoid extremes and make room for nuance and detail) and leveraging the experience and expertise of diverse stakeholders in the present can help us build new desirable futures.

Left to right: Rhianne Jones, ��tv R&D, chaired Panel 3 with Jonnie Penn, Cambridge, Rebecca Fiebrink, Goldsmiths, Ramon Amaro, Nieuwe Instituut, Joel McKim, Birkbeck and Franziska Schroeder, Queen's University Belfast

Alongside the talks, guests enjoyed an immersive curated exhibition of art, music and interactive visualisations in the ��tv Media Cafe featuring work by Patricia Wu Wu, Emma Varley, Wesley Goatley, Jake Elwes, Pip Thornton and a musical performance from Jess +.

Image of a musical performance by Jess Fisher, Deirdre Bencsik, Clare Bhabra, robot

The complexity and urgency of responsible AI requires these important conversations to continue. For more information on how the BRAID programme will continue to do this, .

Interested in getting involved? See the . Have an idea for a project that responds to one of ? Get in contact with ��tv via BRAID or by emailing responsibleinnovationteam@bbc.co.uk

is dedicated to integrating Arts, Humanities and Social Science research more fully into the Responsible AI ecosystem. Funded by the UKRI Arts and Humanities Research Council, BRAID is the first Responsible AI programme of its scale in the UK. BRAID's ambition is to see a wider community of researchers, practitioners and publics collaborate with industry and policymakers to tackle some of the biggest ethical questions posed by AI, building public trust and ensuring the UK remains at the global forefront of the research, development and deployment of AI. To find out more about BRAID, you can read a , or visit our website at

A big thanks to the ��tv-BRAID event launch team led from the ��tv side by Thom Hetherington and to the ��tv Radio Theatre production team in London.

Increasing trust in content: Media provenance and Project Origin

2023-10-24T10:21:30Z

The environment in which our news exists has changed. What used to be secure, immutable, and simple is now open and chaotic. Imagine, if you will, a plate of spaghetti. It might look complex, but each strand of spaghetti is a linear entity. You can see where each element starts and ends. Our flow of news used to be like that. Now, think of a bowl of trifle. A mass of different things all coexisting alongside each other – the jelly, the sponge, the fruit, the custard. Is that a piece of fruit? Or a piece of cake? Is it soaked in sherry? is that real whipped cream or is it from a can? This is the social media confection into which we distribute some of our most precious content. It’s big, it’s flamboyant… and it may not always be good for us.

The ��tv, and others with an interest in preserving access to reliable news, began to worry about this a few years ago. We joined Microsoft, CBC Radio-Canada, and the New York Times to form - a body that aims to secure trust in news through technology. As the project progressed, we found fellow travellers and together set up the Coalition for Content Provenance and Authenticity () to work on a set of open standards which would allow content to contain provenance details. The work began with a range of organisations working to help develop and use media provenance signals to help audiences determine what they choose to trust.

Think of a piece of content you encounter 'in the wild'. It might possess all the branding and properties you'd expect to see from a media organisation, but it could have been falsified. It could have been completely synthesised - even to the extent of a digitally originated human in a fabricated setting. Or it might be a mix of synthetic and 'real' material delivered in the context and with the intent set by whoever has created it. Many news providers have seen attempts to spoof their content, so how can we, as media organisations, protect ourselves, our content and help our audiences understand what they're seeing?

There are two main ways a piece of content might tell you about itself:

It might contain signals that its originator has put into the content - stating its origin and possibly offering other metadata. This is 'declared' provenance.

It might contain artefacts that can be detected or 'inferred'; deepfake detection has always been a somewhat inexact science, but techniques do exist.

Broadly, there are three technologies that can support signalling the provenance of media:

Watermarks began life as visible 'stamps' applied to an image to show ownership - think of the fun-run or wedding photos you were sent - or the little rainbow in the bottom left hand of a DALL-E image. They're relatively easy to process or crop out. More sophisticated watermarks can now be embedded invisibly (to the human eye) into the pixels of an image - Google's new is an example of this.

Fingerprints represent the nature of a piece of content in an approximate, more searchable form. One example is YouTube's Content ID, which uses fingerprints to match audio to a database of registered content. If you've ever had a message to say “No, you can't put a Taylor Swift backing track on your holiday video”, this is how that has come about. Much of the digital rights management we encounter on the internet uses this kind of technology.

Cryptographically Hashed Metadata is a more inflexible but secure method of fingerprinting in that it forms a small and unique representation of the underlying data. If the data changes in any way, even by a single digital bit, the hash will no longer match the data. Protecting the integrity of the hash through a cryptographic signature is an effective way of keeping the integrity of the whole data - by proving the signature was a witness to the hash and checking the data still generates the same protected hash value. It's the basis of the Content Credentials icon which has recently been integrated by major brands and industry leaders, including Adobe, Microsoft, Publicis Groupe, Leica, Nikon, Truepic, and many more, bringing a new level of recognisable digital content transparency from creation to consumption.

The truth is that effective provenance, in this complex ecosystem, may well draw on a combination of these. Practitioners will want to be proactive in distinguishing their content by ensuring audiences can understand what they're seeing, and as Generative AI starts to take hold there's some excellent work on disclosure such as the '' work carried out by the Partnership on AI - work to which the ��tv contributed.

There's a strong appetite amongst news providers to protect their content. And there is evidence that adding provenance has a positive impact on trust in our content online. Work done by ��tv Research & Development and ��tv News to use C2PA signals is now being picked up and developed as a pilot by Origin partner Media City Bergen who, with the , joined the Origin consortium as members this year. The idea is to build a range of options through which organisations can employ provenance signals - directly at the point content is published and using functionality offered by manufacturers. Our in-house research team has found evidence that adding provenance to images increases trust in content amongst those who don't typically consume our content. We also found evidence that provenance evens out trust across a range of images we use (editorial, stock and user generated content).

One of the benefits of using provenance tools will be enhanced transparency. We can not only offer reassurance as to where something has come from, we can share details of how it has been edited or put together. , and being able to be even more transparent about where content has come from can go a long way to helping us achieve this.

Introducing Innovation Labs: bringing the ��tv together to solve problems

2023-12-18T14:36:04Z

��tv Research & Development is evolving a programme of Innovation Labs in a model we call the "Labs Framework" as a way of bringing people, ideas, and technology together. We believe that the new model will deliver significant benefits to our audiences and help us adapt our world-beating services for a changing and competitive world when audiences are fragmented and budgets are constrained.

This post introduces the labs, describes where this model came from, and shares some of our plans for the coming months.

For over a century the ��tv has been an innovator in both technology and content, inventing core elements of radio and television, pioneering online services, and of course creating programme formats that audiences in their millions have enjoyed.

Eight years ago, I led the team that launched ��tv Taster to be a home for new ideas for the ��tv and its partners. We have hosted over 300 projects on Taster, exploring new technologies and formats, such as the ��tv’s first augmented reality project, highly personalised experiences, and iPlayer watch parties that incorporated personal data stores.

My team also launched MakerBox, a toolkit of technology which lets people make their own innovative content experiences. It continues to be a space for experimentation where the ��tv meets our audiences.

However, the world has changed since we launched Taster, and like every other organisation in this increasingly digital world, there is an opportunity to rethink how we do things. I spent time this year researching and speaking to organisations such as and and other public service broadcasters such as and to understand how they approach innovation. Their insight helped evolve the Labs Framework.

After speaking to leaders across the ��tv, it is clear that, while everyone fully acknowledges the importance of innovation, we face real challenges in identifying, testing, and committing to the right ones.

My colleague Henry Cooke’s foresight report anticipates a vastly changed world in the next few years, which has implications for the ��tv and its role as a public service media organisation. The technologies he has been tracking are hugely transformative and there is going to be tough competition in this space as very large players with substantial resources compete for audience attention. There has never been a more important time for the ��tv to innovate.

As the outside world is getting more complex and less predictable, at the same time the ��tv has to make big choices around product direction, its technology roadmap, and its distribution strategy. So we designed a labs framework that takes account of these complexities, to ensure that work taking place in R&D can directly benefit the wider ��tv, building on our understanding of what the organisation will need in the near- and mid-future.

We want to make it really easy for ��tv teams to use R&D’s thinking and technology to solve their day to day and long-term problems, bringing value to audiences as soon as we can. It was a challenge, but we are very pleased with the reception we have had outside R&D and some early successes in channeling our innovative thinking into the wider ��tv.

What is a Lab?

A lab is a group of people who come together to work around a shared objective for an agreed amount of time, bringing together a range of people from across the ��tv including engineers, producers, researchers, designers, journalists, and others into a collaborative framework with appropriate business management and project support.

It produces design ideas, business cases, landscape reviews, prototypes and evaluations. A lab will run design workshops, audience trials and technology assessments to inform its work, all focused around a carefully developed brief.

The approach builds on the significant successes of the joint R&D and ��tv News project, , which has now been running for ten years. Initially, we will be incorporating two additional labs:

News Labs - With a revised agenda aligning R&D, News and ��tv product teams we will be focusing on live & streaming, authoring & storytelling, and trusted news.

Multilingual Lab - There is a huge opportunity to take advantage of new transcription and translation models. We are working to explore a Multilingual Lab which will meet the needs of our teams in the World Service and beyond.

Sound Lab - We have partnered with ��tv Sounds and are building out a body of work exploring audience experiences, personalised content and new ways to experience audio from the ��tv.

Underpinning the three labs, we have two crucial enablers:

The Sandbox - a portfolio of technical platforms which recreate the live product to allow for experimentation in a risk-free environment. Our first sandboxes include the News content management system, and the Sounds web and Android apps. Our colleagues around the ��tv will be able to take advantage of a safe place to trial cutting edge ideas.

Insights Lab - All the work described above will be supported and validated with a new approach to traditional research: we will be working with audiences for a longer time frame and in more considered ways to really get to know them and their needs.

What’s next?

We have a busy six months ahead of us. Over the next few weeks, we will launch our first Sandboxes and bring in new partners from around the ��tv to use them. News Labs and the Insight Lab will run research sessions in south Wales community centres exploring, amongst other things, what audiences want and need from us at live events, including elections.

And in partnership with the ��tv’s product teams, we’ll extend R&D’s work around the In Car experience, focused on commuter journeys.

We can’t wait to get started and will share our progress with you here and on the .

Projections: "Things are not normal"

2023-11-10T17:01:26Z

Late in 2022, we began a straightforward-sounding research project: compile a list of technologies that we should be paying attention to in ��tv Research & Development over the next few years and make some recommendations about their adoption to the wider ��tv. As I’m sure you’ve already guessed, things didn’t turn out quite so straightforward.

By the end of the project, we'd interviewed twenty-two people from the fields of science, economics, education, technology, design, business leadership, research, activism, journalism, and many points between. We spoke to people from both inside and outside the ��tv and around the world. All of these people have a unique view on the future, and our report teases out the common themes from the interviews and compiles their ideas about how things might come to be in the near future.

We grouped the themes we identified into five sections. The first, A complex world, outlines sources of complexity and uncertainty our interviewees see in their worlds. Climate change is by far the largest and most significant of these. The next section, A divided world, also covers big-picture context and outlines some of the social and economic drivers our interviewees see playing out over the next few years. The AI boom and New interactions go into detail on specific technologies and use cases our interviewees think will be significant. Finally, The case for hope bundles up some of the reasons our interviewees see to be hopeful about the future — provided we are willing to act to bring about the changes we'd like to see in the world

We wanted to get a sense of significant technologies and platforms; we also gained a sense of the wider social, political, and economic contexts that would be affecting the ��tv and our audiences in the coming years. After all, no-one uses technology in a vacuum.*

. We hope you find it interesting and useful - please let us know if you do!

* apart from quantum engineers, astronauts, and particle physicists.

At the drop of a frame: Measuring the performance of graphics on TVs

2023-10-18T14:05:46Z

, yet our smart TVs and set-top-boxes often have only a fraction of the processing power in laptops and phones. This makes it difficult for application developers to create fluid and responsive user interfaces. The ��tv supports TV applications, like iPlayer and Sounds, on millions of devices, and providing a good quality-of-experience for everyone is a challenge.

We use browser technology to help us target a wide range of device platforms, but this comes at a performance cost when compared with native applications. Recently, some developers have created TV application frameworks using , and claim that these perform better than applications using traditional and . That is why we are looking into how we can measure the capabilities of different browsers and application frameworks to make our own assessment and determine how to provide the best experience on low-powered TVs.

Traditional web applications use the : the page layout is created using HTML and styled using CSS. This is a sophisticated model, and , as the browser must perform many advanced rendering operations on a complex object tree. To ensure that animations are smooth on TVs, developers can optimise their applications by using . Some TV developers argue that we can reduce the workload even more by simplifying the model and using low-level graphics operations via WebGL. They have created application frameworks that implement this concept and claim they out-perform the DOM on low-powered TVs. We set out to verify these claims to determine if a WebGL framework might be the right approach for our applications.

So how can we measure the quality experienced by the viewer? Well, you may have heard the term 'frames per second' (FPS) used to describe the quality of a video or the performance of a computer game. FPS is a simple measure of how many frames are presented to our eyes every second. We can also consider a similar unit, percentage of frames dropped, to measure the performance of computer animations. Assuming an animation needs to move at a certain FPS, we count the number of times that the computer failed to render a new frame in time. A poorly optimised animation will result in a high percentage of dropped frames, and we perceive it as being jerky.

One way to detect when frames may have been dropped is to poll the browser’s method. However, this is not always as accurate as you might expect since it cannot provide an insight into the entire rendering pipeline. For this reason, it is . Despite this, we have seen measurements made with requestAnimationFrame() used to claim that WebGL frameworks can perform better than the DOM on low-powered hardware. To assess these claims, we took a different approach.

To get a true idea of the quality that our eyes see, we need to look at the physical output of the TV. We have developed a method of doing this by connecting our source to an HDMI-capture device and using it to record what is actually happening on screen. By stepping through the resulting video file frame-by-frame, we can see where there was a difference between frames. If we run a pre-programmed set of animations on the TV, we know when movement should have occurred, so we can determine if and when a frame was dropped, as we will observe two identical frames where there should be a difference. We know how many frames of movement we are expecting, so we can calculate the percentage of frames that were dropped. We used the software tool to automate this process, allowing us to quickly analyse many long animation sequences.

For our experiment, we tested two simple video-on-demand applications with a user interface similar to iPlayer. One was a version written with a DOM-based framework, using optimised CSS animations; the other was identical except that it was written using a WebGL framework. We tested both applications using a standard browser on a , which uses a chipset similar to a TV or set-top-box.

The application under test, rendered using the DOM.

In stark contrast to the claims we have seen about WebGL frameworks, our results show that the optimised DOM application performs significantly better on this platform. While the WebGL application drops 40% of frames, the DOM application drops less than a tenth of that. The recorded sequence shows that the WebGL application was dropping almost every other frame, which would be visually perceived as roughly half the frame rate, around 30 FPS. On the other hand, the DOM application reaches close to the target 60 FPS.

The reason that the results are so different to the claims about WebGL is that methods using requestAnimationFrame() can significantly over- or under-report dropped frames, meaning that measurements can be far from the reality on-screen. Conversely, the HDMI capture method captures the culmination of all stages of the render pipeline and gives an indication of arguably the most important metric: the presentation of animated visual updates to the viewer.

The results of the experiment, showing a comparison of the two frameworks.

Our results show that a DOM application can achieve better performance than its WebGL counterpart on some low-powered hardware, provided it is suitably optimised to . Accelerated CSS properties are already widely implemented by browsers in TVs meeting standards such as and , meaning there is good support on current TV platforms. Continued work with standards groups can further improve and diversify support in the future.

We will continue to use our HDMI capture technique to measure the performance of other platforms and application frameworks, and hope to learn more. By expanding our knowledge on this subject, and through collaboration with our partners, we can make more informed decisions to improve support for optimised applications and continue to provide the best experience we can for our audiences.

Research and Development

Mitigating LLM hallucinations in text summarisation

Saying this loud for the folks at the back

Drawing a circle in chalk

The last 10% is 90% of the effort

The pipeline

Prompt stuffing

Keep it simple

Explicitly instruct the LLM to stick to the provided material

Ask the LLM to check its own work

The prompt

Checking the LLM's work

Putting it together

Things that didn't work (yet)

Few-shot prompting

MapReduce

Final thoughts

Custom fonts for streaming subtitles

Project Timbre: Investigating mobile coverage for live radio streaming on ����tv Sounds

Why we are interested?

What are we measuring?

Quality of Experience and buffering

Digging deeper

Statistical analysis and QoE metrics

Conclusions

Mark the good stuff: Content provenance and the fight against disinformation

How we got here

How does it all work?

A second Emmy for our work on High Dynamic Range television

Extending our Mastodon social media trial

NeRFs - A guide to creating impressive 3D fly-through videos from 2D smartphone images

Our guide to making NeRFs

Why these collaborations are useful

UHD HDR production architecture for the Coronation of King Charles III and Queen Camilla

Workflow improvements

HDR to SDR format conversion

SDR to HDR format conversion

Camera painting

Coronation signal routing

A word on HDR camera shading in SDR...

The big day, Saturday, 6 May

����tv Research & Development - 2023 Highlights!

Adding languages, live and low-latency: our new and updated streaming media test feeds

CMAF

More about the live streams

Low latency

Webcast streams

Languages and access services

5 things we learnt about the barriers we face and bridges we must build to ensure Responsible AI

Increasing trust in content: Media provenance and Project Origin

Introducing Innovation Labs: bringing the ����tv together to solve problems

What is a Lab?

What’s next?

Projections: "Things are not normal"

At the drop of a frame: Measuring the performance of graphics on TVs

Project Timbre: Investigating mobile coverage for live radio streaming on ��tv Sounds

��tv Research & Development - 2023 Highlights!

Introducing Innovation Labs: bringing the ��tv together to solve problems