The Global Legal Entity Identifier (GLEI for short) is a dataset collected from various sources for the Global LEI watch project. "LEIs are designed to be a single, universal standard identifier for any organisation or firm involved in a financial transaction internationally". The dataset here contains data on organisations/companies in different locations and their corresponding identifiers.

Here's what the schema for the GLEI dataset looks like:

LEI, RegistryNumber, LegalName, LegalForm, City, State, Country, PostCode, Address, Latitude, Longitude, LastUpdateDate, PortalDate, Source
The current size of the data is 312,295 tuples.

Possible leads on datasets for integration (copied from Peter's email with little modification): The National Information Centre ( http://www.ffiec.gov/nicpubweb/nicweb/NicHome.aspx) which contains hierarchy ownership information of American financial institutions. You’ll have to get a script to do GET calls using the institutions RSSD ID (Replication Server System Database ID — the unique identifier used by the Federal Reserve), then POST the form details, and then parse the resulting HTML. Alternatively, you could file a freedom of information request. Furthermore, I don’t know of any datasets that connects the RSSD ID and LEI numbers, so you’ll have to connect them using the legal name (and hope that there are no variations in name!).

The International Consortium of Investigative Journalists has a map detailing the leaks about offshore tax havens. You could try to get their database and integrate it. They may be more willing to just send it to you if you ask. Check it out: http://offshoreleaks.icij.org/search

Now the biggest challenge is that these datasets do not utilize the LEI, making it a difficult integration task. The student who takes on this project will have to work on using the LegalName field. There might be little overlap with exact matching (~200 tuples), the student could look into research in entity resolution and data integration for help. The variation in names across datasets could be simple or complex. What we are looking for is not full integration of the two datasets, but results that allow us to use them for future research.