Scrape and Pillage: The Plundering Potential of Data Scraping Bots

Scrape and Pillage: The Plundering Potential of Data Scraping Bots

RufusInfoSecDeutschland

The Supply Chain Attack risks posed by programmatic data scraping bots are often underestimated despite their significant threat to platforms and their users. These automated tools are designed to mimic human interaction with websites or online systems, allowing them to fly under the radar and avoid detection.

Bots are frequently employed to circumvent official APIs or exfiltrate data unavailable through sanctioned means. Engineered to navigate the complexities of target user interfaces, these bots can stealthily exfiltrate sensitive data, including payroll, orders, tax, and other critical datasets.

The underappreciated nature of this threat vector is precisely why we wish to bring these risks to light and emphasize the importance of proactively addressing the dangers posed by data scraping bots.

Update: Snowflake user accounts have been compromised at a massive scale. Per Mandiant, "Mandiant's investigation has not found any evidence to suggest that unauthorized access to Snowflake customer accounts stemmed from a breach of Snowflake's enterprise environment. Instead, every incident Mandiant responded to associated with this campaign was traced back to compromised customer credentials."

Zenefits: The Stealthy World of Data-Scraping Bots

Zenefits was importing data from ADP accounts for use in the Zenefits application. However, Zenefits users were unaware that bots were scraping the ADP data programmatically. This bot acted as a backdoor, allowing Zenefits to exfiltrate payroll data for mutual clients without following ADP's official integration protocols.  

When ADP discovered that Zenefits was using a programmatic data scraping bot to gain access to ADP's systems, they terminated Zenefits' access. Zenefits' end users were unaware that a backdoor scraping bot was being used to exfiltrate data from ADP accounts, causing confusion, disappointment, and frustration. Zenefits pointed the finger at ADP, and ADP pointed the finger at Zenefits. Social security numbers, PII data, financial payments, taxes, payroll... all at risk.

Ultimately, Zenefits employed a data access method they knew to be suspect and against ADP terms.

The Deceptive Dance of Scraping Tools

Let's fast forward to 2024 and switch our gaze to Amazon. Like Zenefits, commercial Amazon software tools use programmatic data scraping bots to exfiltrate data from Seller, Vendor, and Advertising accounts. 

Like ADP, Amazon provides official API pathways for third-party services to request and process account data securely. These APIs rely on industry-standard authentication and authorization protocols, namely Login with Amazon (LWA) and OAuth.

Login with Amazon (LWA) is Amazon's secure authentication system that allows third-party services to verify a user's Amazon account without accessing their password. When a user logs into a third-party service using their Amazon account, LWA validates their credentials and provides the service with a unique user identifier and access token.

OAuth is an open standard for authorization. Users can grant third-party services limited access to their Amazon account data without sharing their direct account login usernames and passwords. The user is redirected to an Amazon login page, where they can review the service request permissions and approve or deny access.

In contrast, the scraping bots completely bypass these official, secure API pathways. By mimicking human interactions, they sidestep the LWA and OAuth safeguards, gaining direct access to an account. This fundamental breach of the official data access protocols is at the heart of this report's security and compliance issues.

As Zenefits did with ADP, third-party software tools in the Amazon ecosystem do the same. One such company is Reason Automation. How did we stumble across Reason Automation? They actively promote access to data sets Amazon does not make available via API, meaning their only method of access to the data is programmatic account scraping bots. Reason Automation also confirmed using Google Puppeteer to scrape programmatic data accounts.

Like Zenefits customers who were unaware of how they accessed ADP data, Reason Automation customers are unaware that the service they are paying for uses data scraping bots.

As it was for Zenefits, the lack of candor is a “lie of omission,” as prospects and customers are often unaware companies employ these data scraping bots.

When Bots Go Bad: The Dark Side of Data Scraping

First, programmatic data scraping accounts violate Amazon's terms and conditions—full stop. A company offering software services knows it violates Amazon's terms when it makes that choice.

Second, using programmatic account data scraping bots within Amazon's Advertising, Seller, and Vendor ecosystem has security implications. Data scraping bots expand TTPs (tactics, techniques, and procedures) while offering a few IOCs (indicators of compromise) before any malicious event.

Most companies, like ADP and Amazon, prohibit programmatic data scraping accounts because they understand the significant security threats of these methods;

  • Data scraping bots bypass official APIs and access control mechanisms, which are fundamental to security. Bots can access and scrape sensitive data, exposing customer, financial, and other proprietary business information.
  • Bots do not comply with approved programmatic security controls that govern third-party services.
  • There is no external review, auditing, or monitoring of access patterns and activities used by bots within the user account environment. Programmatic account scraping bots exist outside established protocols, such as Amazon's RDT certification process, thus escaping scrutiny. 
  • Bot providers "self-certify" that everything they do is secure despite the absence of any external oversight. As such, no controls govern the validation of the scraping programmatic account bot's code integrity or operational security. 
  • Bots have unauthorized data access to user personal information, which creates regulatory risks.

Bot-tom Line: A Master Class in Supply Chain Attacks

In 2022, LastPass, a company focused on security, suffered a data breach. As a password manager, LastPass was a valuable target due to its possession of account credentials for thousands of users. A threat actor targeted an internal LastPass employee by implanting keylogger malware to capture the master password, gaining access to the DevOps engineer's LastPass corporate vault.

The compromised data included system configuration data, API secrets, third-party integration secrets, and both encrypted and unencrypted LastPass customer data.

Companies that engage in account data scraping are particularly vulnerable to these types of Supply Chain Attacks, as they have access to many Seller, Vendor, and Advertising accounts.

Like the LastPass incident, if a threat actor compromises a data scraping software provider, they can access hundreds or thousands of business accounts. Once the threat actor secures access to these accounts, they also gain access to the personal information of Amazon customers.

How hard can it be: Exposing Threats

Tools like Google Puppeteer fall into the class of headless browsers, which automate browser actions for tasks such as web scraping. Other similar tools include Selenium, PhantomJS, and Playwright.

A proof of concept was constructed using Google Puppeteer to demonstrate the real-world risks associated with programmatic data access using scraping bots. Why select Google Puppeteer? Reason Automation uses it. (Note: Puppeteer is based on Chrome, and every Chrome exploit found is an exploit that presents risks in Puppeteer.)

The purpose is to showcase how bots can be automated to achieve two malicious objectives:

  • Exfiltration of personally identifiable information (PII) data
  • Committing financial fraud

To simplify the process and mimic Reason Automation's approach, we followed the documentation provided by AWS on deploying Google Puppeteer on AWS Lambda using container image support. This allowed us to create a scalable and efficient bot infrastructure quickly.

Next, using Reason Automation public documentation for user account access methodology, the bot was granted similar access to an Amazon account. 

The following sections detail how our proof-of-concept bot achieved these malicious objectives, highlighting the severe security risks of programmatic data scraping bots.

Scrape and See: Bots on the Prowl for PII Access

Amazon has strict enforcement of PII access via their API. Ask a software developer trying to get approval for PII data access in the Selling Partner API. If a software developer is approved for PII data access, they undergo intensive audits, monitoring, and review. Developers must also adhere to strict protocols around data access controls and governance. The controls and governance apply to PII data's access, processing, and storage.

However, programmatic data scraping bots have no review or oversight despite exposing a direct pathway to access PII data. They completely bypass Amazon's rigorous oversight, review, and approval.

Replicating the bot using platform Reason Automation and the use of Google Puppeteer, we were able to exfiltrate Amazon PII data. Scraping PII order data is a relatively straightforward process for the bot.  The bot is instructed to take a screenshot of what it sees within order transactions, closely mimicking what a human would.

Bot Screen capture of PII Data

Next, the bot is instructed to navigate to each order transaction methodically. The bot visits each order transaction detail page, takes a screenshot, and then parses the data.

In this case, the PII data is the customer's name, address, phone, shipping information, and order details,

Bot Screen Capture of Customer PII Order Data


With the order page rendered, the bot captures and saves the PII order data. Below is actual PII data scraped into CSV format, which is then exfiltrated to an external database

  1. Order ID: 1XXX-XXXXXXX-XXXXXX45
  2. Ship By: 2024-XX-XX to 2024-XX-XX
  3. Deliver By: 2024-XX-XX to 2024-XX-XX
  4. Purchase Date: 2024-XX-XX X:27 PM PDT
  5. Shipping Service: Standard
  6. Carrier: USPS
  7. Shipping Service Details: XXXXXXX
  8. Ship To Name: XXXXX XXXght
  9. Ship To Address: XXXX XXXX XXX, XXXXX, MO XXXXX-XXXX
  10. Contact Buyer: XXXXX
  11. Phone: +1 XXX-XXX-XXXX ext. XXX70
  12. Items Total: XX.78
  13. Tax Total: X.65
  14. Grand Total: XX.43
  15. Product Name: XXXXXX Outdoor Sandal XXXX XXXX
  16. ASIN: XXXXX
  17. SKU: XXXXXXXXX
  18. Condition: New
  19. Order Item ID: XXXXXXX
  20. Quantity: 1
  21. Unit Price: XX.78
  22. Proceeds: XX.78

If there are 10, 100, 1000, or 10,000 orders, the exposure of this data is non-trivial.

Reason Automation says not to worry: it does not collect personally identifiable information (PII) through scraping activities. They don’t seem to understand that they are opening a door. The key issue here rests in account access. The access provides a path for a threat actor to collect and store data.  The fact that they might not do something (i.e., store data) does not mitigate the security risks associated with their operations.

Juicing the System: Why the Scrape is Worth the Squeeze

The phrase "the juice is not worth the squeeze" implies that the effort, time, or resources required to achieve something are not justified by the outcome or rewards. In other words, the results or benefits are not worth the work or trouble involved. In the case of using the bot's account access to exfiltrate a financial windfall, this juice is worth about five million squeezes.

With a Supply Chain Attack, a focused threat actor can scrape financials to determine how much money is transacted for a seller. 


Targeting "Whale" Acconts


Once a high-value target (a whale!) is found, the threat actor automates the multi-step workflow of adding a new bank account. 


Adding Alternate banking Information

Attaching a bank account has a 3-5-day latency, even with the automation employed. However, once complete, a threat actor has laid a foundation to exfiltrate payments from Amazon to a seller. Three accounts are now attached to the Seller. Which bank account was added by the bot?

Alternate banking Accounts


While this process is more complex, having programmatic access opens the door. As in the case of the LastPass breach, the juice is worth the squeeze for a threat actor focused on financial fraud.|

Note: We kept this automation workflow high due to the severity of detailing it. However, most of the information detailing how to perform these steps is in the public domain, including

Behind the Scenes of Data Scraping: How Scraping Bots Sneak Past Security

How do bots gain access to accounts to perform these operations? Programmatic account access via bots exposes multiple vectors of attack. Rather than going into depth on possible internal threats and AWS resource mismanagement, such as Puppeteer using AWS Lambda, misconfigured AWS S3 buckets, KMS, and other services, some low-hanging fruit are immediately exploitable: spear phishing.

Phishing Phrenzy: Crafting the Perfect Deception

How would a threat actor exploit the low-hanging fruit to execute something similar to the two use cases we should previously have? Below is a phishing email for spear-phishing or whaling attacks to reach individuals with administrative access to Amazon systems.

Thankfully, Reason Automation has documented everything we need to get started.

Bot user account instructions

Programmatic bot access to Amazon customer accounts, including third-party commercial services like Reason Automation, relies on a highly reducible pattern for access credentials like client-[brandname]-[marketplace]@reasonautomation.com for bot access.

Consistency and repeatability provide threat actors with an exposable pathway to undertake a coordinated attack. For example, understanding how the structure and communication bot access means we can easily create spear phishing emails:


Spear phishing email


Note the threat actors' alternate domain names and sense of urgency in creating a perception of authenticity.  

A variation on a spear phishing email that would target a customer and trick them into providing escalated account access:


Spear phishing + fake account portal

The spear phishing email then directs the target to a fake update portal.

Fake account management portal


The screenshot of a fake account portal allows a threat actor to perform several malicious actions. By setting up a portal that mimics an official-looking admin panel, threat actors can trick the user into providing their credentials and a two-step verification code.

This grants the attacker full access to financial and payment controls, allowing them to change payment details, redirect funds, add new fake users with admin roles, modify account settings, and lock out legitimate users.

This is possible because bot software tools create a consistent, structured, predictable pattern for activating bot accounts. The inherent risk introduced by using bots complicates security measures for everyone involved. Protecting against sophisticated spear-phishing efforts is challenging enough, but it becomes even more difficult when the software providers are creating experiences and expectations that allow threat actors to leverage a documented pattern. They can use documented processes to create fear, uncertainty, and doubt (FUD) in the minds of spear-phishing targets.

Bot-ched Compliance: Failing GDPR with Scraping Bots

Most data scraping bot providers' positions regarding their data scraping activities indicate a misunderstanding, or disregard, for their responsibilities under data protection laws such as the General Data Protection Regulation (GDPR). 

In the eyes of GDPR, a system's capability to access sensitive data, irrespective of whether they store it, makes the system operator, like Reason Automation, a data processor. This classification carries specific obligations they seem unaware of. Notwithstanding the choice of using a data access solution that violates Amazon’s terms, their potential lack of awareness or disregard for GDPR obligations suggests a critical gap in their basic understanding of their obligation as a data processor.

The data scraping providers' oversight or misunderstanding of their responsibilities under GDPR is a significant concern, and it highlights a broader issue for any seller, vendor, agency, or business that engages in their services.  For example, Reason Automation seems unaware of its obligations, given that its data processing is a pathway to access PII data.

Under GDPR, data processors are held accountable, and data controllers—the entities on whose behalf the data is processed—are also responsible for ensuring that their processors comply with the relevant data protection laws.

When a business hires a data scraping software tool (knowingly or unknowingly), it becomes the data controller and must, therefore, ensure that the company they hired to process data understands its responsibilities under GDPR.

Failure to do so can result in direct liability for the controller. Should a breach occur that opens a pathway for PII data to be exfiltrated, both the data scraper and the hiring business are liable.

The Bot Stops Here: Mitigating Programmatic Scraping Threats

Reason Automation’s approach of using Google Puppeteer, like other companies using similar account scraping bots to mimic human browser interactions, allows a software tool to completely bypass Amazon's security controls, access protocols, and governance mechanisms.

They operate in a blind spot, where Amazon has no visibility or control of programmatic account access. Using bots in this manner ensures a continued lack of transparency and an increased surface area for threat actors to exploit. Allowing any software provider into your account for data scraping should be extremely concerning to anyone who operates within Amazon's platform.

We only covered a small subset of actions a threat actor may take. The risk is significant if a company like Reason Automation is compromised via misconfigured AWS environments or if an internal compromise occurs on an employee computer like in LastPass.

As Mandiant stated in the Snowflake incident, "Contractors that customers engage to assist with their use of Snowflake may utilize personal and/or non-monitored laptops that exacerbate this initial entry vector. These devices, often used to access the systems of multiple organizations, present a significant risk. If compromised by infostealer malware, a single contractor's laptop can facilitate threat actor access across multiple organizations, often with IT and administrator-level privileges. "

A threat actor could exfiltrate 100s of direct, unmonitored logins to Amazon Seller, Vendor, and Advertiser accounts.

Given the lack of transparency and accountability of data scraping bots, Amazon and most other platforms prohibit programmatic data scraping of accounts. It creates an unacceptable risk for Amazon and its customers, allowing third parties to operate and offer commercial software services that are a threat actors' best friend.

Amazon must take decisive action against programmatic data scraping to uphold its responsibility to ensure its platform's trust, transparency, and safety meet the highest standards.

These software companies publicly operate and promote data scraping from Amazon's platform without consequence. Only Amazon can stop this practice by holding third parties accountable, blocking access to those engaging in unauthorized data collection, notifying sellers, vendors, and advertisers of the risks posed by these activities, and consistently enforcing strict policies against this behavior.

Amazon needs to take proactive steps to protect its ecosystem and the interests of those operating on its platform.

-------------------

Disclaimer

Responsible Disclosure Efforts

Despite multiple attempts to engage with software providers, we have not received a satisfactory response or resolution. On the contrary, many providers, including Reason Automation, see no issue with their data scraping despite acknowledging that Amazon does not permit it. As a result, we are disclosing these findings in a manner that balances transparency with security, ensuring no harm is done to any seller, vendor, advertiser account, or customer.

No Harm to Accounts or Customers

This report does not disclose specific code or low-level technical details that could enable a threat actor to replicate the exploit easily. The investigation and testing were conducted in a controlled environment, with strict adherence to ethical standards, ensuring that no actual seller, vendor, or advertiser accounts or their customers were compromised or harmed in any way.

Limited Disclosure of Technical Details

The report avoids disclosing specific code or detailed technical methodologies that could facilitate malicious actors' reproduction of the exploit. Instead, it focuses on the broader implications and potential security risks associated with the identified vulnerabilities.

No Financial Remuneration Sought or Expected

It is important to note that throughout this research and disclosure process, no payment, bounty, or any other form of financial remuneration was ever requested, sought, or expected. The primary motivation behind this report is to raise awareness about the security and compliance risks associated with data scraping bots and to encourage better practices within the Amazon ecosystem for the benefit of all stakeholders.

References






Report Page