Does ChatGPT have a place in content production?

Niki Lancaster

Head of Creative


Content Marketing

There has been a lot of talk about ChatGPT over the past few months, and one of the hottest topics is using ChatGPT for copywriting. So, we decided to test it out to understand its strengths and limitations and better understand if and how we could utilize ChatGPT to enhance the content production process.

 

Methodology

 

Our experiments took the following steps:

 

  1. Assemble a list of all possible content types and variables to test
  2. Create four different versions of each content type:
    • Version one: Created by a human
    • Version two: Created by ChatGPT at temp 0 (ChatGPT temps range from 0 to 100; the tool claims that the numbers indicate a sliding scale of predictability, so outputs at zero are considered ‘predictable’ whereas outputs at 100 are considered more ‘random’.)
    • Version three: Created by ChatGPT at temp 50
    • Version four: Created by ChatGPT at temp 100
  3. Human review the examples created by ChatGPT, including blind testing.
    • Within our initial working group, we reviewed each piece and collected feedback on the quality
    • We then contacted a wider group outside the experiment to gather more feedback to assess whether any AI-generated content was of a high enough quality to go live on our clients’ websites
  4. Use AI tools to review the examples created by ChatGPT:
    • We ran the content through Grammarly, an automated copy checker we use as a standard part of our QA process
    • We ran the content through Originality to see if any of the content would have been classified as plagiarised or would be flagged as written by AI.

 

 

 

Selecting the content types to test

 

The first step was to form a working group and nail down exactly what we wanted to test. Simply testing ChatGPT on all copywriting is too broad, as the briefs we work on can be quite niche. So, we collated a list of over 18 types of content that we are asked to produce; these include company news blogs, press releases for campaigns, product page copy for websites, interviews, product buying guides, and category page content. We then factored in the different nuances of writing B2B and B2C copy and ensured our list contained variants of both when relevant to test any nuances.

 

Creating the content

 

Once we had the list of content types, we collated the human-generated content and then built prompts to instruct ChatGPT based on the briefs we typically receive to help replicate an AI version.

Once these were ready, we ran each prompt through ChatGPT three times. Once on temp 0, once on temp 50, and finally on temp 100 to help understand the different outputs produced from the different temp settings.

This gave us four variations of the content to test: one created by humans and three created by ChatGPT at the three different temps (0, 50, and 100).

Testing

 

Humans vs AI: Test #1

 

As a working group, we reviewed the initial outputs of ChatGPT and made some notes around the quality of its outputs that covered:

 

  • Spelling and grammar
  • Structure
  • Readability
  • Whether the piece matched the client’s brand and tone of voice
  • The information included
  • How well the piece hit the brief
  • Factual accuracy

This gave us a starting point for whether we would be happy to deliver this content to clients and wider teams.

 

Humans vs AI: Test #2

 

Because we knew which content was written by humans and which was written by AI, and because we hold a bias (we’re part of the content team, so will always advocate for human-written content), we then blind-tested the content with a wider group of people who weren’t part of the working group and who weren’t aware which content was AI and which was written by humans. These people were selected to review the content based on their knowledge of our clients and their involvement as stakeholders in our usual content sign-off process. These people would essentially be gatekeepers for getting our content live or sent to clients.

As well as assessing the quality of the content, this group also looked at whether they’d use the content they’d been sent (either by putting it live or sending it over for sign-off from a client).

Humans vs AI: Results

 

Although surprised by how quickly and legibly written the AI-generated content was, it was easy for the people in the testing group to identify it as AI-generated.

What people liked about ChatGPT and its outputs was the speed at which it delivered information. This made it useful for structuring content, doing light research for ideas and tips, and adding in ideas that previously may not have been considered. For example, it suggested adding warranty information to a product page.

Despite this, the flaws with ChatGPT were obvious, and everyone in the testing group flagged the following flaws:

 

  • It struggled to stick to a strict word count and would often go over
  • It was not the most accurate with spelling and grammar; it would often miss out commas, throw in random capital letters, and sometimes hyphenate words and sometimes not
  • The content generated was usually very generic and did not add stories or anecdotes specific to a company or brand. For example, when we created a blog post about Search Laboratory being awarded our B Corp accreditation, it could not pull out the specific things we had done to achieve this that were unique to us as a business. So, the copy was uninspiring and could have been written for any business that achieved this accreditation
  • It did not stick to stricter guidelines around the tone of voice and would use wording that humans knew immediately would not be considered ‘on brand’ for our clients
  • Similar to the above, ChatGPT did not understand legal issues around the wording in copy. For example, it added guarantees to product copy that would not stack up from a legal perspective
  • As the temperature of the content increased, the tool tended to add information that was not factually correct or go off on an irrelevant tangent. For example, when we created product page copy around a mug shaped like a horse’s head, we were given copy that went on to talk about horsehair
  • When ChatGPT was asked to give us interesting facts and stats about UK air travel, it delivered, but when it came to finding out the source of the facts, they were nowhere to be found. It looks like the tool understands what a statistic is, but it hasn’t yet made the link to objective truth.

AI vs AI

 

Once we had all the results from our human tests, tour next test was to use AI to assess if a piece of content was written by a human or by AI. We were also curious if any of the content would be red-flagged as plagiarised.

To run these tests, we used Originality. Originality is a leading tool for detecting plagiarism and AI-generated content, and it allowed us to test for both areas simultaneously. We tested all four examples of each type of content with Originality to understand its outcomes for human-generated and AI-generated content.

 

This bar chart shows the Originality AI score related to a test carried out on ChatGPT by Search Laboratory digital marketing agency.

 

AI vs AI: Results

 

All the AI-generated and human-generated content scored zero for plagiarism, meaning that none of the information was considered plagiarized.

On average, the AI content scored 91.9% for AI-generated within Originality across the three temps. The human-generated content scored an average of -91.75% as human-original, proving that Originality could quickly and accurately detect which content was AI-generated.

As the temperature number increased, the likelihood that Originality could detect AI-generated content decreased very slightly. Even considering that, it was still obvious to Originality that temp 100 content was AI-generated, as it scored an average of 84.9% AI-generated.

Results summary

 

Both humans and AI could recognize when content was AI-generated. As the temp of the ChatGPT content increased, detecting that the content was AI became slightly harder for Originality, an online detection tool. However, at higher temps the quality is reduced, making it easier for humans to detect.

ChatGPT is good for research, helping with structure, and writer’s block. An improvement in the innate time efficiencies of content production.

The main flaws with ChatGPT are quality control, factual accuracy, and its lack of understanding of a company’s branding and history.

What are our recommendations?

 

Based on its flaws, ChatGPT is not replacing human copywriters any time soon, and we would never recommend publishing any content from the tool without human input.

There are some content briefs that ChatGPT just would not be relevant for, such as written interviews. For everything else, ChatGPT proved useful for research, ideation, and structure. It cut down time for things like Digital PR tips pieces. That said, research with ChatGPT must be done with a large pinch of salt and it cannot yet be relied on for facts and statistics.

We will continue to test AI for use in content creation, and for wider use across the agency. The pace of development of this technology is rapid, so we expect improvements will come very soon.

 

 

Get insights delivered straight to your inboxSign up to our digital marketing newsletter


"