Does ChatGPT have a place in content production?

Niki Lancaster

Head of Creative


Content Marketing

There has been a lot of talk about ChatGPT over the past few months, and one of the hot topics is using ChatGPT for copywriting. So, we decided to test it out to understand its strengths and limitations and better understand if and how we could utilise ChatGPT to enhance the content production process.

 

Methodology

Our experiments took the following steps:

  1. Assemble a list of all possible content types and variables to test
  2. Create four different versions of each content type:
    • Version one: Created by a human
    • Version two: Created by ChatGPT at temp 0 (ChatGPT temps range from 0 to 100; the tool claims that the numbers indicate a sliding scale of predictability, so outputs at zero are considered ‘predictable’ whereas outputs at 100 are considered more ‘random’.)
    • Version three: Created by ChatGPT at temp 50
    • Version four: Created by ChatGPT at temp 100
  3. Human review the examples created by ChatGPT, including blind testing.
    • Within our initial working group, we reviewed each piece and collected feedback on the quality
    • We then contacted a wider group outside the experiment to gather more feedback to assess whether any AI-generated content was of a high enough quality to go live on our clients’ websites
  4. Use AI tools to review the examples created by ChatGPT:
    • We ran the content through Grammarly, an automated copy checker we use as a standard part of our QA process
    • We ran the content through Originality to see if any of the content would have been classified as plagiarised or would be flagged as written by AI.

 

 

 

Selecting the content types to test

 

The first step was to form a working group and nail down exactly what we wanted to test. Simply testing ChatGPT on copywriting is too broad, as the briefs we get into the team vary. So, we collated a list of over 18 types of content that we could be asked to produce; these included company news blogs, press releases for campaigns, product page copy for websites, interviews, product buying guides, and category page content. We then factored in the different nuances of writing B2B and B2C copy and ensured our list contained variants of both when relevant to test any nuances.

 

Creating the content

 

Once we had the list of content types, we collated the human-generated content and then built prompts to instruct ChatGPT based on the briefs we typically get into the team for each piece to help replicate an AI version.

Once these were ready, we ran each prompt through ChatGPT three times. Once on temp 0, once on temp 50, and finally on temp 100 to help understand the different outputs produced from the different temp settings.

This gave us four variations of the content to test; one created by humans and three created by ChatGPT at the three different temps (0, 50, and 100).

Testing

 

Humans vs AI test one

 

As a working group, we reviewed the initial outputs of ChatGPT and made some notes around the quality of its outputs that covered:

  • Spelling and grammar
  • Structure
  • Readability
  • Whether the piece matched the client’s brand and tone of voice
  • The information included
  • How well the piece hit the brief
  • If what we had been produced was factually accurate.

This gave us a starting point for whether we would be happy to deliver this content to clients and wider teams.

 

Humans vs AI test two

 

Because we knew which content was written by humans and which was written by AI, and because we hold a bias (we’re part of the content team, so will always advocate that content is written by humans), we then blind-tested the content with a wider group of people who weren’t part of the working group and who weren’t aware which content was AI and which was written by humans. These people were selected to review the content based on their knowledge of our clients and their involvement as stakeholders in our usual content sign-off process. These people would essentially be gatekeepers for getting our content live or sent to clients.

As well as assessing the quality of the content, this group also looked at whether they’d use the content they’d been sent (either by putting it live or sending it over for sign-off from a client).

Humans vs AI results

 

Although surprised by how quick and legible the AI-generated content from ChatGPT was, it was easy for the people in the testing group to identify it as AI-generated.

What people liked about ChatGPT and its outputs was the speed at which it delivered information. This made it useful for structuring content, doing light research for ideas and tips, and adding in ideas that previously we may not have considered. For example, it suggested adding warranty information to a product page.

Despite this, the flaws with ChatGPT were obvious, and everyone in the testing group flagged the following flaws:

  • It struggled to stick to a strict word count and would often go over
  • It was not the most accurate with spelling and grammar; it would often miss out commas, throw in random capital letters, and sometimes hyphenate words and sometimes not
  • The content generated was usually very generic and did not add stories or anecdotes specific to a company or brand. E.g., when we created a blog post about Search Laboratory being awarded our B Corp Accreditation, it could not pull out the specific things we had done to achieve this that were unique to us as a business. So, the copy was uninspiring and could have been written for any business that achieved this accreditation
  • It did not stick to stricter guidelines around the tone of voice and would use wording that humans knew immediately would not be considered ‘on brand’ for our clients
  • Similar to the above, ChatGPT did not understand legal issues around the wording in copy; for example, it added guarantees into the product copy that would not stack up from a legal perspective
  • As the temp of the content increased, the tool tended to add information that was not factually correct or go off on an irrelevant tangent. For example, when we created product page copy around a mug shaped like a horse’s head, we were given copy that went on to talk about horsehair
  • When ChatGPT was asked to give us interesting facts and stats about UK air travel, it delivered on it, but when it came to finding out the source of the facts, they were nowhere to be found. It looks like the tool understands what a statistic is; but it hasn’t yet made the link that they need to be true.

AI vs AI

 

Once we had all the results from human tests. The next thing we did was to use AI to assess if it could work out if a piece of content was written by a human or by AI and whether any of the content was considered plagiarised.

To run these tests, we used Originality. Our research showed Originality as a leading tool for accuracy and allowed us to test for both areas simultaneously. We tested all four examples of each type of content with Originality to understand its outcomes for human-generated and AI-generated content.

 

A bar chart showing the score given on Originality.ai for content created by ChatGPT. This was part of an experiment by Search Laboratory digital marketing agency.

 

AI VS AI Results

 

All the AI-generated and people-generated content scored zero for plagiarism, meaning that none of the information was considered plagiarised.

On average, the AI content scored 91.9% for AI-generated within Originality across the three temps. Whereas the human-generated content scored an average of -91.75% as human-original, proving that Originality could quickly and accurately detect which content was AI-generated, and which was written by humans.

As the temp number Increased, the likelihood of Originality detecting that the content was AI-generated decreased very slightly. However, it was still obvious to Originality that temp 100 content was AI-generated, as it scored an average of 84.9% AI-generated.

Results summary

 

Both humans and AI could recognise when content was AI-generated. As the temp of the ChatGPT content increased, detecting that the content was AI became slightly harder for Originality. However, at higher temps the quality reduced, making it easier for humans to detect.

ChatGPT proved good for research, helping with structure and writer’s block. This led to time efficiencies with content production.

The main flaws with ChatGPT were around quality, how factually correct the information was, and its lack of understanding of a company, its brand, and its history.

What are our recommendations?

 

Based on its flaws, ChatGPT is not replacing human copywriters right now, and we would never recommend publishing any content from the tool without human input.

There are some content briefs that ChatGPT just would not be relevant for, for example, written interviews. For everything else, ChatGPT proved useful for research, ideation, and structure. It cut down time for things like Digital PR tips pieces. That said, research with ChatGPT must be done with a large pinch of salt and it cannot yet be relied on for facts and statistics.

We will continue to test AI for use in content creation, and for wider use across the agency. The pace of development of this technology is rapid, so we expect improvements will come very soon.

 

 

Get insights delivered straight to your inboxSign up to our newsletter