ML Systems Integrator Pte Ltd

+65 6990 9055

    Confidential Information Of 1.2 Billion People Discovered In Massive Data Leak

    Confidential Information Of 1.2 Billion People Discovered In Massive Data Leak


    On October 16, 2019 Bob Diachenko and Vinny Troia discovered a wide-open Elasticsearch server containing an unprecedented 4 billion user accounts spanning more than 4 terabytes of data.


    A total count of unique people across all data sets reached more than 1.2 billion people, making this one of the largest data leaks from a single source organization in history. The leaked data contained names, email addresses, phone numbers, LinkedIN and Facebook profile information.


    What makes this data leak unique is that it contains data sets that appear to originate from 2 different data enrichment companies.



    How Does Data Enrichment Work?



    For a very low price, data enrichment companies allow you to take a single piece of information on a person (such as a name or email address), and expand (or enrich) that user profile to include hundreds of additional new data points of information. As seen with the Exactis data breach, collected information on a single person can include information such as household sizes, finances and income, political and religious preferences, and even a person’s preferred social activities.


    Each time a company chooses to “enrich” a user profile, they are also agreeing to provide what they know about the person to the enriching organization (thereby increasing the validity of the organization’s future results). Despite efforts from social media organizations like Facebook, the resulting data continues to be compounded, creating a situation with no oversight that ultimately allows all of a person’s social and personal information to be easily downloaded.



    The Open Elasticsearch Server



    The discovered Elasticsearch server containing all of the information was unprotected and accessible via web browser at No password or authentication of any kind was needed to access or download all of the data.


    Elasticsearch stores its information in an index, which is similar to a type of database. The following is a screenshot of the different indexes (databases) available on the discovered server.





    The majority of the data spanned 4 separate data indexes, labelled “PDL” and “OXY”, with information on roughly 1 billion people per index. Each user record within the databases was labelled with a “source” field that matched either PDL or Oxy, respectively.



    Company 1: People Data Labs (PDL)



    Based on our analysis of the data, we believe the data in the PDL indexes originated from People Data Labs, a data aggregator and enrichment company.


    De-duplicating the nearly 3 billion PDL user records revealed roughly 1.2 billion unique people, and 650 million unique email addresses, which is in-line with the statistics provided on their website. The data within the three different PDL indexes also varied slightly, some focusing on scraped LinkedIN information, email addresses and phone numbers, while other indexes provided information on individual social media profiles such as a person’s Facebook, Twitter, and Github URLs.


    According to their website, the PDL application can be used to search:


    Over 1.5 Billion unique people, including close to 260 million in the US.

    Over 1 billion personal email addresses. Work email for 70%+ decision makers in the US, UK, and Canada.

    Over 420 million Linkedin urls

    Over 1 billion facebook urls and ids.

    400 million+ phone numbers. 200 million+ US-based valid cell phone numbers.



    Attribution To PDL



    After notifying PDL, we were informed that the server in question does not belong to them. This is consistent with our research as the server in question resided on Google Cloud, while PDL API appears to use Amazon Web Services.


    In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups per month.



    Almost 100% Data Match



    The data discovered on the open Elasticsearch server was almost a complete match to the data being returned by the People Data Labs API. The only difference being the data returned by the PDL also contained education histories. There was no education information in any of the data downloaded from the server. Everything else was exactly the same, including accounts with multiple email addresses and multiple phone numbers.


    To confirm, we randomly tested 50 other users and the results were always consistent.



    An Interesting and Unique Match



    One of the phone numbers returned for my profile was 1-636-825-2744. I do not remember ever having this phone number, so I decided to look into it. Roughly 10 years ago I was given a land line as part of an AT&T TV bundle. The landline was never used and never given to anyone – I never actually owned a phone, yet somehow this information appears in my profile.


    When I checked my account on PeopleDataLabs.com, the returned results were identical – including that phone number.

    Since I have never seen this phone number appear in any of my previously breached/leaked records, this is a very good indication that the leaked database originated from PDL.



    Company 2: OxyData.Io (OXY)



    After some basic sleuthing, I came across OxyData.io, another data enrichment company. OxyData’s website claims to have 4TB of user data (exactly the amount discovered), but only 380 million people profiles.



    OxyData Analysis



    Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIN data, including recruiter information.
    Upon contacting OxyData, I was also informed that the server did not belong to them. Oxy was not willing to give me access to their API to test/compare profiles, but they were nice enough to send me a copy of my own record for analysis. The data they sent contained mostly scraped LinkedIN profile, and appears to be a match for the data data.



    Who Is Accountable?



    This is an incredibly tricky and unusual situation. The lion’s share of the data is marked as “PDL”, indicating that it originated from People Data Labs. However, as far as we can tell, the server that leaked the data is not associated with PDL. This raises a number of other questions. First, how did this mystery organization get the data? Are they a current or former customer? If so, the data discovered on the server indicates that this company is a customer of both People Data Labs and OxyData.


    If this was a customer that had normal access to PDL’s data, then it would indicate the data was not actually “stolen”, but rather mis-used. This unfortunately does not ease the troubles of any of the 1.2 billion people who had their information exposed.


    If this was not a breach, then who is accountable for this exposure?



    The Problem With Attribution



    Identification of exposed/nameless servers is one of the most difficult parts of an investigation. In this case, all we can tell from the IP address ( is that it is (or was) hosted with Google Cloud.


    Because of obvious privacy concerns cloud providers will not share any information on their customers, making this a dead end.


    Agencies like the FBI can request this information through legal process (a type of official Government request), but they have no authority to force the identified organization to disclose the breach.


    One could argue that because PDL’s data was mis-used, it is up to them to notify their customers. One could also argue that the owner of is responsible and liable for any potential damages. But legally, we have no way of knowing who that is without a court order.


    Due to the sheer amount of personal information included, combined with the complexities identifying the data owner, this has the potential raise questions on the effectiveness of our current privacy and breach notification laws.



    About Data Viper



    Data Viper is a next-generation threat intelligence platform, providing organizations, investigators, and law enforcement with the ability to search across thousands of data breaches, with full historical visibility into private, deep, and dark web hacker channels, pastes, and forums. Data Viper is designed for both brand monitoring and threat actor intelligence research. For more information on how we can help in data breach and cyber-criminal investigations, please contact us.



    Source : dataviper.blog