Skiplagging, CAPTCHA vs Bots, Scraper Bots

Available on:
Season 2, Episode 5
14th September 2023

This month’s episode takes off with a journey into the controversial world of skiplagging, also known as hidden city flying. Airlines and holiday businesses are taking legal action against passengers and websites like Skiplagged that exploit pricing loopholes, leaving empty seats on the second leg of multi-stop itineraries. But with scraper bots at the root of the issue, is there a technical solution to limit the practice?

On the topic of bots, a recent report from the University of California, Irvine, revealed that bots are now faster and more accurate than humans at solving CAPTCHA challenges. In this episode we discuss whether there is still a place for CAPTCHA in detecting bot traffic, and try to decipher Elon Musk’s comments about the report – Does it spell the end of bot detection, and is his X subscription model the only answer…?

To conclude, we go more in depth on scraper bots – not only do they facilitate skiplagging, but there are endless uses for scrapers, both well meaning and malicious. How concerned should businesses be about scraper bots, and does their presence often indicate more sinister attacks on the horizon?


Dani Middleton-Wren

Danielle Middleton-Wren

Head of Media, Netacea
Matthew Gracey-McMinn

Matthew Gracey-McMinn

Head of Threat Research, Netacea
Chris Collier

Chris Collier

Head of Solution Engineering, Netacea
Gary Clarke

Gary Clarke

Solutions Engineer, Netacea

Episode Transcript

[00:00:00] Dani Middleton-Wren: Hello and welcome to Cyber Security Sessions.

I am Dani Middleton-Wren, Head of Media at Netacea. Today I am joined by Matthew Gracey-McMinn, Head of Threat Research at Netacea, Chris Collier, Head of Solutions Engineering, and Gaz Clark, Solutions Engineer. And we are going to be discussing skip lagging, and AI versus CAPTCHA, as well as our bot attack of the month, scraper bots.

So topic number one of today's podcast, we are going to cover skiplagging or hidden city ticketing. This is where instead of booking a direct flight to your intended destination, it's cheaper to book a flight elsewhere that has a connection at your intended destination. You then just don't board the next flight on your itinerary.

And airlines are battling to stop this practice because it is losing them money. American Airlines have taken the website SkipLagged to court and are reportedly banning passengers who use their services and cancelling their frequent flyer miles. Meanwhile, Lufthansa has unsuccessfully sued a passenger who SkipLagged in 2019.

So I'm absolutely fascinated by this practice, so to start with, let's dig into what skiplagging actually is and how it works and why people are able to hack the system using this technique.

Matt, do you want to start us off with a quick overview of what skiplagging is?

[00:01:26] Matthew Gracey-McMinn: Yep, quite happy to do so. So, skiplagging essentially is where you book a connecting flight through to a final destination. So, let's say I wanted to go on holiday in New York. I've got a nice hotel booked in New York. And I look at flights to New York and say they're about £800. Quite expensive, quite steep. But I see there's a flight to, say, San Francisco that connects through New York that's about £500. But potentially, let's say I don't want to take my suitcase so it won't transfer, I could just fly to New York, get off the plane, and not get on my connecting flight. And then I've essentially had a flight to New York for about £500 instead of £800. As a consequence, my seat on the second flight will go unfilled.

No one will be on it or anything like that, which is quite frustrating for the airlines, as you might imagine. But essentially, it gives me a better price, and all I'm doing is losing the ability to take my suitcase. It's a rather simple idea that takes advantage of sort of market dynamics and that sort of thing to try and... Essentially get a cheaper price and take advantage of the physical connection where you're physically in the city you want to get to without needing to get onto a second one.

[00:02:30] Dani Middleton-Wren: And so I suppose it's an ethical question, would you say, rather than a legal question at this point? From the reading I've been doing into the Lufthansa case, they said they stated in their terms and conditions that, well it goes against their policy, but that doesn't necessarily mean it's illegal.

It just means that Lufthansa, if they want to take somebody to court about it and sue, as they have done, then they are within their rights. But I suppose for the passenger, it's an ethical question whether they're happy to leave that seat unfilled on their connecting flights, but it's not illegal.

They are just playing the system. And if it comes down to, say, if you're a frequent flyer and you're thinking, Oh, great, I can save myself £500 to £1000 here, then I'm going to do it. Because over the course of a year, you could save yourself an awful lot of money.

[00:03:16] Matthew Gracey-McMinn: Yeah, there is an ethical element of it. Personally, I feel a bit not too comfortable with the idea. I mean, I appreciate people want to save money, cost of living crisis and all that. Personally, though, it feels a little bit deceptive. You know, when you're booking an airline ticket, you're actually entering a contract with someone.

So it feels to me kind of bad faith. And like you say, it's a breach of terms of service. So you are putting yourself at risk there.

It's also the climate question. You know, that second flight, like you say, is going to run. The plane's going to fly because they think there's going to be people on it.

If, say, two thirds of the plane, the people on the plane on that... Layover flight decide not to get on it because they just want to stop at the layover that plane could be running at, say, one third of the people or possibly in extreme situations with no one on it, which is not great for the environment. Let's be honest.

[00:03:58] Dani Middleton-Wren: Thanks, Matt. So let's expand on that business impact a little, Chris. So Matt touched on a few points there, including the financial impact. The competition in the market and what is driving passengers to skiplag. What is the financial impact going to be on airlines and what do you think that their perspective is on skiplagging and why they are now really cracking down on it? Why are we seeing the likes of Lufthansa and United Airlines and American Airlines, why are we seeing them start to react so much to this?

[00:04:28] Chris Collier: Well, I think the airline pricing logic is an incredibly complicated thing for you to have to do. I think like whenever you think about any sort of competitions and markets, it's incredibly hard for you and your competitors to be able to continually keep outpricing each other, particularly on particular routes and things like that.

I think why you're starting to see this emergence of skiplagging is because you have two different types of travelers, really. When you think about it, you have people that travel for business, you have people that travel for leisure, and you're going to find that the more leisure orientated routes are going to probably be cheaper because they're going to want people to fill those seats and get people on those planes and things like that.

Whereas things that are maybe more business orientated routes is so like Boston to Houston as an example that would probably mean for me considerably more business orientated, right? Boston, massive tech hub and stuff like that in the US. Same with Houston in Texas, as an example, you'd expect to be a lot of business people on those flights.

But let's say for argument's sake, it was a flight from Houston to Las Vegas, but for some bizarre reason you did some sort of weird layover in Boston as an example and that's the intended area and that's how you've then got that cheaper because it may be that the operator that you're working with is actually having to fight competition in order for them to actually price that particular flight from Houston to Las Vegas with a layover in Boston as an example cheaper than the vast majority of the other competitors versus in that particular market, i.e. Just Houston to Boston as an example and the operators that actually run that route because you've also got to take into account, not all operators will fly every route either as well, right?

And so I've barely even scratched the surface and you can see how complicated it can get just from a pricing perspective alone and the routes that you're actually operating on. Why do people do it? Well, let's be fair consumers want the get the best thing that they can get the least amount of money that they can spend, like... we are in a lot of areas of the world right now in some sort of economic crisis. Inflation is quite high and stuff like that.

So yeah, you can understand why consumers possibly want to do that. I mean, it's not something that I've ever done and like you I wasn't really aware that people were doing this until fairly recently. And I find it kind of fascinating that you can actually get hold of tickets for longer haul flights as an example in quite a lot of cases, albeit split over two journeys for cheaper than going somewhere there's maybe a lot of a shorter distance as an example there, which I find interesting.

[00:06:51] Dani Middleton-Wren: Really, it's interesting because it's kind of what train lines have started to do, but they've legitimized it. You can get a train fare and do a split save. They offer that and say "we will facilitate you getting a better rate for your journey".

And you can save yourself a lot of money. So do we think this is a direction that the airlines will take?

[00:07:10] Chris Collier: It's an interesting one isn't it when you think about it because you can understand why airlines like Lufthansa and American Airlines are taking people to task about it. Because it's not just the financial impact of potentially losing money because you're not operating the entire both journeys as an example, it may be that the connecting flight was somebody else's and they've got to pay half of that fare over to the other airline. But what I think a lot of people don't think about is the knock on effect that actually by you missing that connecting flight, right?

So you're gonna have potential delays for other people that need to get on those planes while they're waiting and calling your name and you're just never gonna turn up for that flight. That can lead to fines for those airlines as well purely and simply because they have turnaround times. So like when planes land in airports they have to have them cleaned, people off, baggage off.

Reloaded and people back on them and back out again within certain periods of time and stuff like that. And they have to push away from the gate and be in the air within certain periods of time and they have very tight deadlines. Like a lot of people don't realize how busy the air actually is and how many planes are all over the place at any one given time and when you're not going for that flight, you've got the potential delays of that, but then also the running costs of flying a flight that's less capacity and you're still having to burn all of that fuel that you've had to basically fill the plane with for no reason.

[00:08:30] Dani Middleton-Wren: Yeah, and to carry a certain number of people with a certain weight distribution. Yeah, there are so many things that you and Matt have both pointed out that are, yeah, absolutely things that people probably won't consider if they're using this system. Just thinking purely on an economical scale about what they can personally save.

Gary, let's have a think about some of the technical issues at play here. So, aside from taking Skiplagged and other services to court or banning passengers, is there another technical solution for airlines to prevent skiplagging?

[00:09:01] Gary Clarke: Yes, well, I suppose it depends on the source of the information, where it's coming from, because anyone can go online, read about skiplagging. Anyone who knows what skiplagging is, then they can just go to the airline and try to exploit the system and look for cheaper flights that way.

So in terms of technically putting a solution against that, it's going to be very difficult. There are things they could do, for example, like on the website, when you're going to check out, they could issue you with a warning stating that if you're not going to be completing the whole legs of your journey, then you're going to be issued with a fine or something along those lines. Even at the airport, they could, I think Matt stated before that you could get on a flight if you're not checking in any bagging, it's fine because you're not going to lose anything, but they could force you to check in any hand luggage there. So if you get off before the end of the journey, then you're going to lose that hand luggage.

So In terms of that, there's not really much they can do, but when you look at sites such as, they get the data by actively scraping these airlines. They look for travel data and price data. And I think the only way to combat that is if you get some kind of scraper blocking strategy. And to do that, you're going to have to implement some kind of bot management solution that specializes in web scraper detection and mitigation.

And what that'll do is it'll block all requests from certain websites, stop that data being pulled and it'll stop it being accessible in real time.

[00:10:23] Dani Middleton-Wren: Gary, you made some points there about what on the technical side, the airlines can be doing to prevent scraping activity and stop that ability to collate all of that data in an automated fashion and put it on an aggregator site like Skiplagged and make that option available to people, but what else can businesses be doing to influence customer and passenger behavior that is perhaps drilling down to that ethical stance and reinforcing it?

[00:10:52] Chris Collier: I think that's one of the things that's really interesting about what they've done with the court case from a Lufthansa point of view. And I think they're actually appealing that, haven't they? Because I think the courts in Germany actually found in the passenger's favor.

Because it was to do with pricing and the fact that they wanted the funds back for basically the flights that he'd not basically paid for. I think there's maybe the, moral side of it. And I think if they'd have gone to court a more moral standpoint on it as well, I think it'd be interesting to see organizations and businesses like Lufthansa, American Airlines and other airlines think about the actual environmental impact of what this is actually doing and have a more moral view on it than one of the financial impact that they're actually getting. And I think that maybe if they were to stand up and be a bit more like, "guys, we get that obviously you want to fly cheaper and things like that. And we are trying to work on how to be able to do that. But what you're doing is essentially just as bad as you throwing plastic in the ocean."

[00:11:45] Dani Middleton-Wren: Yeah, I think it is a difficult one, and that's why I say the subject is so fascinating because in a time when we are all struggling for cash, that is why Skiplagged is thriving.

[00:11:54] Matthew Gracey-McMinn: Yeah, I wonder if they were perhaps more transparent around why there is this pricing preferential, because from a consumer perspective, it does seem a bit counterintuitive that if I'm flying to one place and then to another place, that's cheaper than flying to just the first place.

That is very counterintuitive to just how we look at pricing and how it should work. So consumers do tend to feel like, "am I being taken advantage of here?" And then suddenly it's like, "if I'm being taken advantage of, I don't mind taking advantage of you in return if I've got that option."

So perhaps a bit more transparency around where those prices come from you might find customers and probably even the courts are a bit more sympathetic towards the airlines as well.

[00:12:33] Dani Middleton-Wren: Absolutely. So it seems like what the airlines can be doing is saying, this is why, this is, like you're saying, Matt, provide that justification for the pricing. It might be because as Chris has already observed, you've got the competition in a high traffic business route, and that's why the price is what it is, or it might be competition in the market that allows them to justify. It could be the increase in price of fuel, which we've seen a lot over the last few years, especially with a geopolitical activity. There are lots of different reasons why prices have escalated. I think that transparency is key from the airlines, making sure that customers understand the impact of their own decisions when it comes to purchasing flights and not showing up for flights. What that cumulative effect is. It is going to have a knock on effect that many of us wouldn't even consider.

You think "well,I'm one person, the flight's going to be going anyway." But if we all think like that, we're in big trouble. And then Gaz, to your point about how platforms like SkipLagged can be stopped. What airlines can do, we need to think about what technical approaches we can take to enable airlines to prevent that activity. And that is for the most part going to be detecting and preventing scraper bots and the like, which we will explain in greater depth as we go on later, when we go on to our attack of the month.

[00:14:04] Dani Middleton-Wren: Okay. So let's move on to topic number two, AI versus CAPTCHA. So we're all familiar with Captcha, the puzzles we're sometimes shown by web services to prove that we are human and not those pesky bots. These have evolved from simply typing in some numbers and letters to clicking on traffic lights or bridges within grids of squares to even more elaborate and difficult tests like non existent AI generated images.

However, a recent study from the University of California concluded that bots are now faster and more accurate at solving CAPTCHA puzzles than humans. And it's by some degree. If we look at those figures, it's something like between 51 percent and 84 percent of humans will accurately solve a CAPTCHA test, whereas the bots, it ranges from, depending on the type of CAPTCHA test, it can range between 97 and a hundred percent.

That is a much higher successful completion rate. So then you've got Elon Musk weighing in and he has concluded the only way to stop bots at scale on his X app was via, and we may all have predicted that this would be his response, via his premium subscription model.

[00:15:12] Chris Collier: Makes you laugh, doesn't it?

[00:15:13] Dani Middleton-Wren: He does regularly.

So let's have a think about that University of California research. It's a really interesting paper. I encourage anybody to have a read of it. It's extremely thorough and it really explains why sophisticated bots particularly are able to do this. And it highlights how bots have evolved over time.

CAPTCHAs were put in place a long time ago , and it raises the question of, is CAPTCHA alone a good enough way to tackle bots, or does there need to be something more in place? So let's start with Matt, does this research align with what you've seen regarding CAPTCHA bypasses?

Cause I know you've done a lot of research in this area.

[00:15:53] Matthew Gracey-McMinn: Yeah, 100%. Absolutely 100 percent aligns with what we've seen. In fact, our own research and that we've looked into it really highlighted one key point, which is that... so what it was originally used for was to distinguish between humans and bots on websites. And then people thought, what a great way to train AI as well, you know.

Google, for instance, used those old text based apps to train their Google Books projects so they could recognize the words in the books. And we've used it since to train image based AI. So basically, every time you're filling in the CAPTCHA, you're actually feeding information to an AI model telling it, you know, this is what a bridge looks like, this is what traffic lights look like, this is what bicycles look like.

Information was put into AI models. So as a civilization, everyone on the internet has been collectively working together to train AIs to be really, really good at doing the thing that gets past CAPTCHA. And then we're kind of acting surprised now that computers are really good at getting past CAPTCHA.

The combined human effort involved in training these is probably quite substantial. I would suspect probably more than went into training ChatGPT.

And we were surprised by that. So, it's not really a huge surprise to discover that, you know, bots are now better than humans at getting through CAPTCHA. In fact, I almost suspect that we might get to the point where we provide CAPTCHAs and if you fail it, we let you through. If you pass it, we block you.

That would always be more effective at this point. Obviously we deal with bots at Netacea, so we see automated attacks over the internet all the time. And attackers actually have for years now put, CAPTCHA bypass services - and there's a whole sort of subsidiary ecosystem of CAPTCHA bypass services - that provide API endpoints. So bots that hit a CAPTCHA can make a request out to this service and it's usually charged about one to three dollars per thousand CAPTCHA solves. They make a request out to the service and it will solve the CAPTCHA for them and let the bot through. It's very cheap, very effective, and most of those bots advertise a success rate, against any sort of CAPTCHA, all the different providers of CAPTCHA and everything.

They advertise success rates of well over 95%, often a 99. 999 percent success rate. And our testing of these services showed that they're not really lying about that, unfortunately.

[00:18:06] Dani Middleton-Wren: I really like your point there, Matt, that the failure rate is probably a better way to go. If that's the way to prove you're human, fail that test.

[00:18:16] Matthew Gracey-McMinn: I would like them to go that way 'cause I have in my team become almost famous for my ability to fail CAPTCHA repeatedly. So it would be nice if we could switch that way so I could get through sometimes.

[00:18:27] Dani Middleton-Wren: Yeah, and I think the report also quoted some demographic statistics as well about the time it takes for people of certain ages to pass the test. And I think you're just never going to get that with bots, are you? Bots don't have an age. They're just going to do it instantly. Whereas with a human, if you, maybe if you incorporate either, yeah, the failures for success.

Great. Or if you start to measure the time it takes for somebody to pass a test rather than it simply being somebody's ability to pass a test, then perhaps we'd have something to go on to justify that it's a human being rather than a bot.

[00:19:00] Matthew Gracey-McMinn: That does actually raise an interesting point on the history of CAPTCHA bypass though, because originally before AI got good enough for people to use that to bypass, the way past it when humans were better was actually to outsource it to a human. So original CAPTCHA bypass services would just send them to, I think sweatshops almost, of people solving CAPTCHAs.

And on average, these people would solve CAPTCHAs probably once every 18 seconds or so. But they'd just be solving loads and loads of CAPTCHAs. And that actually raised a lot of ethical concerns, because that was, the amount they paid, it was pretty much just slave labor at that point. Conversely though, when you look at the sort of AI models, at least that's sort of replacing that human slavery. So that there are some good elements to this development as well 'cause what we've seen with a lot of the CAPTCHA bypass services, they're moving to the AI models. It's better than having humans solve it.

And, people do look for rapid passing of CAPTCHA and so forth, as examples. And instead, AI models know that that gets them blocked. So they delay it and they try to act more, I was about to say human like, but it's not actually human. The bots are now more human than the humans, apparently.

So we're in this really weird situation of if you are identified as a bot, you're probably a human at this point.

[00:20:10] Dani Middleton-Wren: Yeah, and like you said, we've trained them, we've given them all this information about ourselves. We've passed it on. But those sweatshops that you referred to, so Chris, they're often called ClickFarms or were called ClickFarms and they were a much bigger part of the CAPTCHA bypass system. So Chris, you, I believe, experienced working in CAPTCHA ClickFarm for a day.

[00:20:31] Chris Collier: I certainly did, yeah. Many, many moons ago, when we were very first starting to look into CAPTCHA farms and things like that, I think, Tom and I decided that we were going to go and spend an hour just having a go and seeing how much, how many CAPTCHAs we could solve and how much money we could make.

And Matt is absolutely correct. I think in half an hour, we made about 10p. And we were solving CAPTCHAs literally one after the other for half an hour solid. And I think we made about 10p. And that was what was really interesting about the evolution of it. Just looking at it from that point of view. Like when we very first started to say, huh, are they actually farming these CAPTCHAs out to people to actually solve these as ways of getting past them.

We then found those services, thought, how easy is it actually to employ yourself? Could you actually make a living wage on doing this? Right? Because let's have a look at it. Obviously not. No, but what was interesting though is they had lots of sites that had lots of profiles of people that were working there and saying, oh, how, how great their life was because they were working for them and solving the CAPTCHAs and stuff like that.

When honestly, I don't know how they were making any money at all, which is kind of interesting.

[00:21:40] Matthew Gracey-McMinn: Our recent infiltration of one of those CAPTCHA farms, so some of them talk quite extensively about the sort of ethical angle and human slavery angle. They actually say, you know, we're paying $100 a month to some of these employees, which in parts of the world that they're employed in is really good. And it's better that they're in front of a computer than a toxic factory or something like that.

So that they're much better off. But when we actually joined those CAPTCHA projects and started working on them, what we found was that to hit that $100 a month you'd have to be solving a CAPTCHA, on average, every 18 seconds. So basically, you'd have to be solving CAPTCHAs for 16 hours a day, 7 days a week.

You would still have to solve a CAPTCHA every 18 seconds. So no breaks for 16 hours a day, 7 days a week. No toilet, no food, no sleep, nothing. You're just working flat out to reach that $100. So we don't think anyone actually is. And the ethics of it, like you're saying, Chris, it's a pretty awful gig.

[00:22:35] Gary Clarke: It's certainly not minimum wage.

[00:22:37] Dani Middleton-Wren: No, significantly not. The likelihood of somebody reaching that, like you said, hundred dollar a month ceiling is absolutely, it's not going to happen, is it? Like you say, it's, unless you are under some sort of torturous environment where you are unable to leave your desk and you are only there without sleep, without toilet breaks, without food for 16 hours a day.

And if you're in gainful employment, then I don't think that's going to happen, is it? I think your previous terminology of slave labor is probably the most accurate. And hopefully we will start to see these cAPTCHA farms dissipate. Okay. So let's go on to the technical elements here.

So Gary, do you want to talk about whether CAPTCHA is still useful in determining who is bots and who is human? So how has that, as bots have become more sophisticated, how has CAPTCHA actually adapted to suit the new environment?

[00:23:32] Gary Clarke: I just want to say first that I'm team CAPTCHA, so I'm very much in favour of CAPTCHA. And I think it definitely can differentiate between a human and a bot. I think it's worth thinking about is that... People try to bypass CAPTCHA, it's not a brand new thing, you know, CAPTCHA has been around since the turn of the century and just as CAPTCHA has been there that long, so has people have been trying to outsmart it, you know, using different techniques such as automation, bots, other methods such as CAPTCHA farming, we've just been talking about CAPTCHA farming then, and that's been around since the days of YouTube and MySpace when they first started using Captcha.

So that's been around a very long time too. And there's always going to be people trying to pass Captcha and there's always going to be different iterations of Captcha trying to make it harder for people to do that. What we've seen at Netacea is that CAPTCHA does work on bots, and all you need to do is look at the statistics of the CAPTCHA served versus the CAPTCHA pass.

And if you do that, you can see that the CAPTCHA pass rate is extremely low. And the CAPTCHA pass rate in this instance being low is very good 'cause what you're seeing is that a lot of these CAPTCHAs aren't being attempted at all because the bots don't know what to do with them.

However, some bots can, and a lot of bots do, in fact, pass CAPTCHA, so you shouldn't solely rely on CAPTCHA to stop bots. What you should be doing, really, is using it in tandem with other methods, such as Netacea Bot Management Solution, for example. So, what happens there is that even if a bot's passed CAPTCHA, you don't stop monitoring those requests, and then if that bot does start to exhibit bot like behavior, the machine learning models will pick that up and then you can upgrade that request to a hard block. And what we're seeing though is that the false positives is still really, really low. I think out of a million requests, you can count the number of false positives on maybe one or two hands. I think in summary there that although CAPTCHA can be useful in stopping bots, it's wise to use it conjunction with another blocking strategy for best results.

[00:25:25] Dani Middleton-Wren: But why is it that the sophistication of bots today requires that extra level. And what is it that Netacea Bot Management, for instance, what is it that today's bot management solutions need to look for to tackle bots in addition to Captcha?

[00:25:41] Gary Clarke: Yeah, so I think with Captcha it basically focuses on one request, so can I pass that Captcha? Can I click this button? Can I select the monkey out of the list of animals, just as an example. But what bot management solutions do, is they continue tracking those requests coming from that person, that author, that bot, and then they can build a bigger picture then to determine if it's malicious intent, so I suppose with Captcha you're just focusing on one or two maybe very small requests, but bot management focuses on a much bigger user journey.

[00:26:14] Dani Middleton-Wren: So would you say that Captcha still plays a valid role in helping build a wider picture of the, of a single user's journey? And whether they are in fact human or bot.

[00:26:24] Gary Clarke: I'd say yes, because as I mentioned before, we are seeing a lot bots being served CAPTCHA, and they're just not attempting them because they don't know how to. So I still think it's a very important part of bot mitigation.

[00:26:37] Matthew Gracey-McMinn: It forms an important part of defense in depth. Like Gary says, it stops the simple bots, it stops the less sophisticated ones. That's great. It gets rid of all of those. More sophisticated ones that can get through, you need additional defenses to stop those.

But this is the whole principle of defense going back way, way into the past. You know, you don't just build your walls and then if the enemy gets past them go, well that's that, I guess we give up, sort of thing. You know, you put layers of walls, you put layers of defenses, uh, in your fortified position.

It's the same principle here in cyber security.

[00:27:04] Dani Middleton-Wren: Great. Thank you both for that. That was really helpful. But let's nip on to Elon Musk's perspective here. Because obviously everyone has their opinions about the effectiveness of CAPTCHA and how bots are manipulating the system.

[00:27:19] Chris Collier: So let's have a look at this from just like a factual point of view, right? A user of Twitter posted a link to some additional research with an image of some of the research statistics showing the absurdly fast and incredibly great statistics that you can use ML for solving CAPTCHA and things like that.

And I went and had a read of that research and you're right, it's incredibly in depth and it was really good. So I do urge people to go and have a read of it if you can get your hands on it because it was great. But what's really interesting is obviously... The user wasn't the person that actually published that research in any way, shape or form.

So he was not part of the research team. And Musk is the owner of the platform.

[00:28:01] Dani Middleton-Wren: Hi,

[00:28:01] Chris Collier: So he needs cash to be injected into his business. And he's absolutely going to use basically his platform to push his agenda. And what's really interesting about it though, is when you actually look at the research, the research has used the Alexa top 100 as part of their research, and Twitter, when I checked it yesterday, ranked 7th, so it was absolutely part of that research, whether they call it out and say yes or no, it was absolutely part of that research.

And we're like coming up to close to a year of the Musk era of Twitter now, and I'll be honest. Besides this interesting rebrand that we've got going on and the management reshuffle, I'm not really sure that there is anything that's technically changed over at Twitter right now in order for him to boldly claim that by paying them money your bot problem seems to go away, and what is even more interesting is if you go and have a quick look at their page to talk about X premium, even though he's touting it as a selling point as to why it's really good, it's not listed anywhere on there as a USP as to why you would want to pay them. So he's always had a very interesting stance on bots, hasn't he? Particularly on Twitter, to be fair, let's be honest, in that platform in particular.

But I don't see how paying him more money or paying Twitter money is going to solve the problem. It's one of those situations where you need to have, as Matt and as Gary have said, you have to have layers of defenses at this, right?

So, oh, bots are great at passing CAPTCHA now, that's why you need to pay me more money. There's a lot of other things that I think are... about him pushing that, like, his agenda in that regard, more purely from a financial point of view than for any other reason.

[00:29:46] Dani Middleton-Wren: Yeah. I mean, his, tweets or are they called X's now? I didn't really know. But he stated that passport defenses are failing, which we know to be true. And only subscription works at scale. But that seems like a very arbitrary comment because he hasn't stated what his subscription service does to enhance bot protection there. He's just then quoted it within that tweet, the table from the University of California Research, which to your observation there, does kind of imply some ulterior motive there. And he also said that telling AI bots apart is becoming increasingly difficult and will soon be impossible.

And the only social networks that will survive will be those that require verification. So perhaps there has been some kind of game plan here all along and we're seeing the fruits of that labor.

[00:30:35] Chris Collier: The guy's not daft, like, say what you want about him and his personal life and the way that he talks and stuff like that, like, the guy's not daft. He isn't daft. He's built tech companies his entire career, basically. And so I can understand why there may be stuff that's going on in the background that we're just not aware of and things like that.

Right. However, you'd think that if he wants to be more transparent and get more people to subscribe and stuff like that, that'd be more transparent about what it is that you're doing to combat this problem on the platform. Like, let's be fair, there's massive brands out there, like, particularly like trainer brands, as an example, from a scalping point of view, that are massively vocal about what it is that they're trying to do to defend their platforms from bots and getting hold of things or impersonating people and things like that. Whereas the particular, the statement where it's like "past bot defenses are failing and only subscription works at scale", how to say something, but say nothing at the exact same time. You know what I mean? Like it's one of those...

[00:31:35] Dani Middleton-Wren: It doesn't take into consideration Gary's point from earlier that CAPTCHA can be effectively used as part of a wider bot protection strategy. It is just totally wiping it out as an option, which again is very narrow minded.

[00:31:51] Matthew Gracey-McMinn: It almost seems that at times, having watched the sort of Twitter response to bots over the last year, that there's almost that sort of silver bullet, the magic bullet that will solve this problem. And we see that a lot with a lot of companies that we work with as well. There's a sort of, "What is the one thing I can do that will solve this?"

And like any security problem, there isn't one thing. You're dealing with a complex variety of problems and you need that layered defense. Each defense is sort of a to a different style of attack. And so long as an attacker is going to get what they want from an attack, they are going to be motivated to try to overcome those defenses. So putting in place subscriptions and so forth, that's great. That will reduce the number of attackers. Because for some, it's like, it's no longer worth the return. The return on investment is not worth me actually, you know, putting in the money to get around all the time, all the resource to get around that defense. But for some people, it will still be worth it. Unless you start putting ridiculous prices in there. And again, for some people, it might still actually be worth them paying that, and they may have the resources to do it to get the returns that they want. We know that Twitter has been used in the past to spread propaganda and disinformation for people who are interested in particularly American elections, and trying to influence the outcome of that.

We've got an American election coming next year, so we're very interested to see how bots will act on Twitter, and how much investment nation states in particular, who have largely unlimited funds. We'd be very interested to see how much they are willing to invest in trying to influence these social media platforms, or increasingly trying to stop the influence bots from running on them.

[00:33:23] Dani Middleton-Wren: And particularly now that X is privatized, right?

[00:33:29] Chris Collier: Yeah, it is private owned company now. I don't think it's listed anymore since obviously Elon took over it. And I think I suppose that that's where it then becomes a little bit more interesting, isn't it? Because it's not a case of that you're trying to appease shareholders and stuff like that. I mean, don't get me wrong.

Obviously when you're working in a publicly traded company, there's a lot of stuff that you're obviously trying to do, but the shareholders are also part of that, right? Bit different when your commander in chief is the commander in chief and that's it. There's no shareholders. And if he says, this is what we're going and this is the direction, I don't know.

[00:33:57] Dani Middleton-Wren: Okay, obviously, as we've already discussed today, scraper bots and the tirade of activity and furore they can cause businesses. We mentioned it with skiplagging lagging earlier, but let's have a discussion about bot attack type as our attack of the month. So Matt, can I start with you to give us a little recap on what scraper bots are?

[00:34:23] Matthew Gracey-McMinn: So, scraper bots are one of the, perhaps the simpler types of bots in many ways, because they're fairly easy to understand. So a scraper bot's job essentially is to extract information from a target site. The goal, essentially what it does is think of it as a sort of almost a spider that goes across a website.

Visits all the pages that may be of interest to it and extracts the relevant information it's after from those pages and reports that back to an originating source. Now, that sounds very simple, but there's a wide variety of applications for that, some of which mean that that is a self contained attack.

That's the entire attack itself. In others, it actually leads to other attacks. So if you look at the self contained attack, if you have a scraper that goes across, say, a media site, say a newspaper site or something that's usually locked behind a paywall, you could extract all of those newspaper articles from there and resell them or just hand them out for free.

Similar things with, say, stock image sites or stock video sites. Any sort of media site that sells content that is on the internet, it could essentially extract from that for resale elsewhere or to offer for free, essentially subverting that organization. In other cases, it may actually be used to inform another attack, so you could use it to identify a particular URL or path that is of particular interest to you.

You could look for a specific thing on a website by extracting all that information. One of the more common uses of it that we see is against the retail sector in scalping attacks. So a scraper will essentially sit on, say, a particular page of an interesting... product, repeatedly making loads and loads and loads of requests very, very rapidly in order to look for the exact moment at which, say, a high demand, low supply item becomes available.

Once the scraper has pulled back that the item is now available and in stock, it will inform a scalper bot, the scalper section of it, to place the item into the basket and check out with it as many times as possible, as quickly as possible. Acquiring, say, 200 of, say, a limited edition doll or a collectible card or something like that, and then push those for sale elsewhere at a marked up price, making a lot of money. We saw this a lot with PlayStation 5s during lockdown and that sort of thing. You see it with trainers a lot at the minute. Those are perhaps the most popular targets at the moment. Rare sneakers, that sort of thing. So the scraper bot essentially is a really versatile bot that's used in loads of different situations, either for a whole attack in itself or to facilitate subsequent attacks.

And as a consequence, I see it as quite a significant problem.

[00:36:56] Dani Middleton-Wren: Yeah, because like you say, it can be deployed for a myriad of different outcomes. it's a really interesting one. And, you know, we've talked a lot today about skip lagging and how scraper bots are used there. shall we also talk a little bit about whether scraper bots are always bad? They can be used for good, right, Matt?

[00:37:17] Matthew Gracey-McMinn: They can absolutely be used for good. One of the uses actually we saw during, again, the COVID pandemic was monitoring of vaccine sites. There were lots of vaccine sites where you could book for vaccination appointments.

And a lot of people were very desperate to get in there, but they weren't sure when they would be available in their local area and so forth. And it was quite difficult, particularly for those people who perhaps aren't quite so technically adept to try to find these appointments and figure out when they're available, when they're not, how to book them, and so forth.

We actually saw a lot people creating these bots that would simply scan and essentially alert, often on Twitter or via email, people to say, hey, click this link and go through book now, there are slots available. And that was a really interesting use of scraper bots that I initially thought was really good, but what we saw, it actually then developed a bit over time as more and more people started doing the same thing.

We ended up seeing some vaccine sites, particularly in the US, where it was a lot, quite often done hospital by hospital. These sites weren't built for the amount of traffic that a scraper bot can produce, and some of the scraper bots weren't built very well. So they just make ridiculous amounts of requests to these sites.

One of them would have been fine. Two of them, also fine. 50, 60, 100 of these badly built scraper bots hitting the site suddenly became a problem and brought the site down, making vaccination booking impossible. So we have this really weird sort of interaction between it being the intent behind this was good, but the actual outcome was in fact the exact opposite of what was intended.

So it became a rather interesting ethical question. As you said, like, you know, initially it was a good thing. I suppose it's like anything, really. You know, the object, the bot itself, is not a bad thing. It's the use of it that is bad. To use Star Wars as an example, lightsabers are neither good nor evil. Both the Jedi and the Sith use them.

[00:39:03] Dani Middleton-Wren: That is a fantastic analogy there, Matt. It means that all of us understood it perfectly. So do you think there is, so, instances like you just described there, Matt, where the bots have been applied for both good and for bad. You've got more people using them. Does that mean that scraper bots have subsequently become more sophisticated? We might think about API scraping in this instance. So what do we know about that? What can people be doing to protect, say, their APIs and understand this evolution of scraper bot attacks?

[00:39:38] Gary Clarke: Okay, so I think first it's important to understand the difference between web scraping and API scraping. So typically what web scrapers do is they look at a website and they get content from that website, and they could use that whichever way they want. They can pull that data, put it on their own website, or they can, they could even do a full clone of the website if they wanted to pull traffic away, and act as like a man in the middle attack for phishing attacks. API scraping is slightly different. They use the API endpoint to get data rather than the content. In both cases, basically, if you want to stop these scrapers, you're going to need some kind of multi layered protection.

Like, as we mentioned before, you could have some kind of CAPTCHA. Now, serving CAPTCHA against an API endpoint is... it's not easy. It used to be nigh-on impossible, but we're getting to a point now where it's, it's very much possible.

[00:40:27] Dani Middleton-Wren: To understand how to protect against scraper bots, do you first need to know they're there. And how does one understand or identify that scraper bots are on their website? Because we know from past experience that there a risk of a lot of dwell time when it comes to detecting scraper bot activity, it can go under the radar.

And that's why attacks like arbitrage betting, or, you know, that's how they're facilitated, because that scraper bot activity, it's low and slow. It can take you a long time to detect. Being able to identify it, is that the most important thing, do you think, to tackling the problem?

[00:41:02] Gary Clarke: Yeah, I think identifying them is a very important factor and I suppose identifying them can be quite difficult, I mean, because these scraper bots are essentially pummeling your infrastructure with requests, so that can mess with your SEO data, it can screw with analytics, conversion, can go right down, and it's only when you start looking at that data and you'll start looking, why is everyone accessing this certain URL?

Why is everyone trying to go to these particular links, but we're not converting? What is the problem there? So yeah, the first part of that is to understand what these scrapers are and what they're trying to do.

[00:41:37] Dani Middleton-Wren: So as Matt said, it's understanding the outcome. So why might a scraper bot be used on that API

[00:41:43] Matthew Gracey-McMinn: I think one of the ways you can also approach it is if you know you have an alternative problem, say, you know, if we take gambling as an example, you have people doing arb betting. What do they need to do arb betting? Well, they need to know the odds and they need to know them to the second. How are they doing that so rapidly, so accurately?

Well, they must be getting that information off our site. So I guess they're scraping us. So now you know you have a scraping problem. It's just a matter of finding where the scraping is happening. And as Gary was saying, that depends on the differences of the site and you need to just look at where you're seeing unusual amounts of traffic or unusual behaviors on the One of the common things my team actually highlights is we often see companies will have, say, a lot of their content on the site and be like, well, we're not seeing any odd behavior on those URL paths, so I guess we're not being scraped. And then what we're able to often find is, say, an exposed API that actually relinquishes all of the information from the site on a single request.

So suddenly scraping doesn't need to be aggressive. So people often think of scraping as very aggressive. It doesn't need to be, depending on how you've architected, how you've built and set up your site. If you put all of that information on a single page, a single request can get it back. If you put it on an API and allow it all to go out in a single request, attackers will make a single request and get it out.

[00:43:02] Dani Middleton-Wren: And if that API is unprotected, then you're knackered. So yeah, it's all about recognizing that weakest point, your greatest vulnerability, acknowledging that and being a bit more sophisticated in your thinking to detect those bots and I suppose to your point, Matt, think outside the box.

[00:43:19] Matthew Gracey-McMinn: Yeah, the attackers certainly are. In fact, one of the easier ways we find to actually think outside the box is to go and find the people doing the attacks and see what they're doing, what they're talking about, because they might, they quite often will tell in public forums, or if you talk to them, they might actually tell you how exactly they're doing this, and that can really reveal to you how they're launching that attack.

[00:43:38] Dani Middleton-Wren: Well, yeah, if they've worked really hard to exploit the system, they want to tell people about it.

[00:43:43] Matthew Gracey-McMinn: Exactly.

[00:43:43] Dani Middleton-Wren: Well, thank you all so much for joining us today. We've covered some really interesting topics. I think I could have talked about skiplagging for hours because it is so interesting. You know, we may see that topic develop over time.

So watch this space, hear this space, come back. We might talk about it again. It will be really interesting. Yes, thank you for joining us today. If anybody would like to hear more about what is coming up on the cyber security sessions, you can follow us @CyberSecPod. You can also send us questions, find out more about our panelists on each and every episode. You can subscribe and also leave reviews. Thank you all very much. And I look forward to seeing you all again on next month's episode.

Show more

Block Bots Effortlessly with Netacea

Book a demo and see how Netacea autonomously prevents sophisticated automated attacks.

Related Podcasts

S02 E07

Validating AI Value, Securing Supply Chains, Fake Account Creation

In this episode, hosts discuss AI validation, ways to secure the supply chain, fake account creation with guest speakers from Netacea, Cytix and Risk Ledger.
S02 E06

Protecting Privacy in ChatGPT, Credential Stuffing Strikes 23andMe, Freebie Bots

Find out how to make the most of ChatGPT without compromising privacy, how 23andMe could have avoided its credential stuffing attack, and how freebie bots work.
S02 E04

National Risk Register, Encrypted Messaging, Residential Proxy Networks

This month we begin by examining the 2023 National Risk Register, explore the issues surrounding encrypted messaging apps, and look at the rise of residential proxy networks.

Block Bots Effortlessly with Netacea

Demo Netacea and see how our bot protection software autonomously prevents the most sophisticated and dynamic automated attacks across websites, apps and APIs.
  • Agentless, self managing spots up to 33x more threats
  • Automated, trusted defensive AI. Real-time detection and response
  • Invisible to attackers. Operates at the edge, deters persistent threats
Book a Demo