Challenge submissions

The evaluation server has been closed. However, the leaderboard are still being maintained. We only accept submissins with processed results. Please evaluate your method using our code with the ground truth data, and upload a zip file contains processed json in packed_test folder and jpegs in benchmark-visualization/jpg.

The submission website is password-protected to prevent abuse. Please contact the organizers at image-matching@googlegroups.com for the password (please account for short delays in answering and uploading close the deadline). Please upload the results as a zip or tarball containing the JSON file and your features/matches, if applicable. You can also check the status of your submission via the status tracking spreadsheet.

Please always run our validation script to ensure your submission is in proper format. We also have a general tutorial on how to use our benchmark and create submission file and a tutorial specific for custom matcher, please have a look if you have trouble on creating submissions.

Challenge categories

Submissions are broken down into two categories by number of keypoints: we consider a "restricted" budget of 2048 features, and an "unlimited" budget (capped to 8000 features per image for practical reasons). In previous editions we also broke down submissions by descriptor size, but nearly all participants opted for 128-dimensional floating-point descriptors (float32), which is the maximum size allowed this year. May 25, 2021: We have removed this rule. You may use descriptors of any size. If you use descriptors larger than 128D, we ask that you submit custom matches instead of using built-in matchers: you may use the benchmark to obtain them, but they need to be in the submission — this is required in order to keep our compute budget in order. You are still required to submit descriptor files. If your method does not use descriptors at all, you may leave these files empty. If in doubt, please reach out to us.

Submission format

Submissions should come in the form zip files containing keypoints, descriptors, for every dataset and scene, and a single JSON file with metadata and settings. Matches can be provided, or generated by the benchmark. If provided, we require separate files for stereo and multiview (the optimal settings typically vary across tasks — even if they are not, you must provide two files). The datasets are labeled by the benchmark as "phototourism", "pragueparks", and "googleurban". For example:

$ ls my_submission
config.json  googleurban  phototourism  pragueparks

$ ls my_submission/pragueparks
lizard  pond  tree_new

$ ls my_submission/pragueparks/lizard
descriptors.h5  keypoints.h5  matches_stereo.h5  matches_multiview.h5

Please note that we do not allow combining different methods for local feature extraction and matching in a single submission. For instance, you may not use HardNet descriptors on the PhotoTourism dataset and SuperPoint on the PragueParks dataset, or RANSAC on one dataset and SuperGlue on another, as this goes against the spirit of the competition. Different hyperparameters are of course allowed. Other scenarios will be handled on a case-by-case basis: for instance, using segmentation masks to filter out irrelevant features on one dataset, but not another where the segmentation may be less reliable, is fine, and using SuperGlue followed by DegenSAC on one dataset, but not another, is also acceptable. Please use common sense and reach out to the organizers with specific questions when in doubt. Note that submissions eligible for prizes will require a document explaining their contents (see below), and may be disqualified if they violate these rules.

Configuration file

Let's look at the configuration file, i.e., config.json. Results will be saved into <json_label>.json, and labeled as <method_name> on the website: short and descriptive names are preferred. Additional information can be placed into method_description.

{
  "metadata": {
      "publish_anonymously": false,  /* Must be public after deadline to claim prize */
      "authors": "Challenge organizers",
      "contact_email": "image-matching@googlegroups.com",
      "method_name": "Short description", /* Becomes the label on the leaderboards */
      "method_description": "For example: we use standard RootSIFT features, with cycle-consistent matching, and the ratio test. For stereo, we use DEGENSAC (Chum et al, CVPR'05) with optimal settings.",
      "link_to_website": "https://www.myproject.org",
      "link_to_pdf": "",  /* Links can be empty */
  },
  "config_common": {
      "json_label": "rootsift-degensac", /* Results file: please use a safe string */
      "keypoint": "my_keypoint", /* A label for your keypoint method */
      "descriptor": "my_descriptor", /* A label for your descriptor method */
      "num_keypoints": 2048, /* Must be 2048 or 8000 */
  },
  "config_phototourism_stereo": {
     (...)

The rest of the file contains six additional fields, one for each combination of dataset/task, e.g. "config_phototourism_stereo" or "config_googleurban_multiview". If a field is missing, that dataset/task will not be processed, and the submission will be assigned the last possible rank on it. The following is an example of the configuration file when custom matches are not supplied. In this case, matches are computed by the benchmark, including the initial matching using nearest-neighbor search in descriptor space and, for the stereo task, robust estimation with RANSAC.

  (...)
  "config_phototourism_stereo": {
      "use_custom_matches": false,
      "matcher": {
           "method": "nn",  /* See methods/feature_matching/nn.py for options */
           "distance": "l2",  /* L2 or Hamming */
           "flann": true,  /* Fast Library for Approximate Nearest Neighbours */
           "num_nn": 1,  /* Number of nearest neighbours */
           "filtering": {
               "type": "snn_ratio_pairwise",  /* Standard ratio test */
               "threshold": 0.90,  /* Ratio test threshold */
           },
           "symmetric": {
               "enabled": true,
               "reduce": "both",  /* Symmetric matching with cycle consistency */
           },
      },
      "outlier_filter": {
          "method": "none",  /* Must be "none" for challenge submissions */
      },
      "geom": {
          "method": "cmp-degensac-f",  /* DEGENSAC (Chum et al, CVPR 2005) */
          "threshold": 0.75,  /* Inlier threshold */
          "confidence": 0.999999,  /* Confidence threshold */
          "max_iter": 100000,  /* Maximum number of iterations */
          "error_type": "sampson",
          "degeneracy_check": true,
      },
  },
  "config_phototourism_multiview": {
      "use_custom_matches": false,
      "matcher": {
           "method": "nn",
           "distance": "L2",
           "flann": true,
           "num_nn": 1,
           "filtering": {
               "type": "snn_ratio_pairwise",
               "threshold": 0.95,  /* Relax or tighten the ratio test for SfM */
           },
           "symmetric": {
               "enabled": true,
               "reduce": "both",
           },
      },
      "outlier_filter": {
          "method": "none",
      },
      "colmap": {},  /* Currently unused */
  }

However, most participants opt for supplying custom matches. Note that in this case, you must run RANSAC yourself if you wish so, particularly for the stereo task. For the multiview task, COLMAP will always run its own RANSAC, but you can also pre-filter it with other algorithms. You must supply a list of matches separately for every dataset, scene, and task (even if they happen to be the same).

Note that for experiments on the validation set, the metadata field is optional, and a single JSON file can contain multiple entries, encoded as [{ <method_1>}, { <method_2>}]. Submissions to the challenge must contain a single entry.

Formatting your features and matches

Let's look at the data files now. They are encoded as HDF5 files. Let's consider an example with a toy scene with four images:

$ ls toy_dataset/toy_scene
image1.jpg
image2.jpg
image3.jpg
image4.jpg

Keypoints and descriptors must be provided separately for each image, using image filenames (minus the extension) as keys. The keypoint file must contain an array with N keypoints with 0-indexed (x, y) coordinates, with the origin in the top left corner.

>>> import h5py
>>> with h5py.File('my_submission/toy_dataset/toy_scene/keypoints.h5') as f:
>>>     for k, v in f.items():
>>>         print((k, v.shape))

('image1', (2048, 2))
('image2', (1984, 2))
('image3', (2013, 2))
('image4', (2048, 2))

Note that the number of features for every image is capped to 2048 for the "2k" category, and 8000 for the "8k" category: submissions where one image has more than 2048 keypoints will move to the "8k" category, and submissions where one image has more than 8000 keypoints will not be processed. Descriptors should be consistent with the list of keypoints. They must be stored as float32.

>>> import h5py
>>> with h5py.File('my_submission/toy_dataset/toy_scene/descriptors.h5') as f:
>>>     for k, v in f.items():
>>>         print((k, v.shape))

('image1', (512, 128))
('image2', (512, 128))
('image3', (512, 128))
('image4', (513, 128))

If you want to specify your own matches, you will have to provide them for every possible pair of images in each scene. The match file should contain a key for every match using the convention LARGEST_KEY-SMALLEST_KEY. For instance, for this toy set the file would contain six keys, as follows:

>>> import h5py
>>> with h5py.File('my_submission/toy_dataset/toy_scene/matches-stereo.h5') as f:
>>>     for k, v in f.items():
>>>         print((k, v.shape))

('image2-image1', (2, 102))
('image3-image2', (2, 405))
('image3-image1', (2, 145))
('image4-image1', (2, 88))
('image4-image2)', (2, 245))
('image4-image3', (2, 47))

Each of these entries stores a list of matches between the image encoded by the first key and the second key, such that:

>>> with h5py.File('my_submission/toy_dataset/toy_scene/matches-stereo.h5') as f:
>>>    print(f['image2-image1'].value)

[[524    10    2 ... 1009  510 2011]
 [398  1087  618 ... 2002 1467  558]]

So that keypoint 524 in image2.jpg matches with keypoint 398 in image1.jpg, and so on.

The "task" section of the configuration file for a submission with custom matches will look like this (for the "Phototourism" dataset):

  "config_phototourism_stereo": {
      "use_custom_matches": true,
      "custom_matches_name": "my_matcher",
      "geom": {
          "method": "cv2-8pt",  /* Must not be changed */
      },
  },
  "config_phototourism_multiview": {
      "use_custom_matches": true,
      "custom_matches_name": "my_matcher",
      "colmap": {},  /* Unused */
  }

The "custom_matches_name" must be the same for all tasks and datasets, even if the files are different. It is only used to identify the method. The "geom"/"method" field must be set to "cv2-8pt" if custom matches are enabled, as we assume you tune and run your favourite RANSAC algorithm, if applicable.

Note that while the benchmark takes additional input, such as keypoint scores (which can be used to subsample the list of keypoints by their score) and their scale or orientation, this information is not used in the challenge. More examples are available in the baseline repository. For up-to-date documentation, please refer to the benchmark documentation here.

Validating your submission

Plase always run submission_validator.py in our benchmark repo to make sure your submission is in proper format before uploading to our challenge website. The validation script need to collect image names in the dataset folder to validate if your submission contains all the image keys. You need to download all the test data and put them in a folder to have structure as follow:

├── [Dataset 1]
│   ├── [Sequence 1]
│   │   ├── images
│   ├── [Sequence 2]
│   │   ├── ...
├── [Dataset 2]
│   ├── ...

Additional constraints

Most participants will use custom matches. In this case, we ask that they detail how the matching was performed (including RANSAC hyperparameters, if applicable) in a PDF (see next section). However, some participants may choose to use built-in benchmark methods. In that case, we cap the number of iterations as follows:

These are generous limits, close to saturated performance.

The benchmark is quite heavy in terms of computational cost, so we will enforce a limit on the number of submissions: each group may send three submissions per week. Note that hyperparameter tuning on the test set is not allowed and will be penalized.

Explaining your submission

Participants should provide a short description of their solution, where they detail their method and the data used for training and/or hyperparameter selection (if applicable), and guarantee it does not overlap our test set. To this end we ask them to upload a short, non-anonymous PDF containing a description of their method. This is only necessary for methods which score in the top 10 for each category, or have not been processed yet. A template is provided at the top of this page. This PDF must be sent by email to image-matching@googlegroups.com the day after the submission deadline. The organizers reserve the right to disqualify participants from the challenge for failing to meet these requirements.

The organizers reserve the right to disqualify participants if there is reasonable cause to suspect of cheating or unfair practices.

The organizers reserve the right to request code from winning submissions to reproduce the results. If necessary, this would be done privately, respecting the license, and organizers would delete the code after such verification. Open-sourcing submissions is always welcome but not required.

Hyperparameter tuning on the test set (via multiple submissions or any other means) is not allowed.

The use of multiple "accounts" is not allowed.

Terms and conditions

The primary focus of this challenge is to provide tools to facilitate the understanding of the wide-baseline image matching problem and advance the state of the art. However, any challenge needs to take measures against unfair practices (and outright cheating). One example: while it would be possible to run RANSAC for tens of millions of iterations, which would not help advance science.

Determining a winner

Each submission will be scored and ranked by the evaluation metric, as stated on the competition website, which is the accuracy of the estimated poses in terms of the mean Average Accuracy (mAA) at a 10-degree error threshold. Performance is averaged across six data points (two tasks times three datasets) and methods are sorted by average rank. The organizers reserve the right to update these and any other terms until the submission deadline.

Anonymous submissions

We allow anonymous submissions (but discourage them). The following fields will be anonymized: authors, contact_email, method_description, link_to_website, link_to_pdf. We still require them to be properly filled with enough information to get a basic understanding of the submission. The following fields will always be public: keypoint, descriptor, method_name. The latter is used as a label for the method and should be self-explanatory (e.g. not a random string). Please note that the organizers will release all information after the challenge deadline. Methods with incomplete descriptions (e.g. description: "Paper under review.") will not be processed. A 1-, 2-line description should not be enough to fully reproduce your method. Metadata may be edited after being processed only to add links (paper, code repository) or to fix errors.

Using pretrained models/training data

The use of pretrained models (e.g. ImageNet, parts of the MegaDepth dataset, and so on) is allowed, as long as they were not trained on data related to the test set: see the next section. For example, using a pretrained model which is trained on the full YCC100M is prohibited, as it overlaps with our test set. Participants must disclose the pretrained models they used in their description and guarantee that any data they used does not overlap with our test data, in terms of images and scenes. According to colleagues, this is the list of overlapping scenes from the MegaDepth dataset: 0024 (British Museum), 0021 (Lincoln Memorial Statue), 0025 (London Bridge), 1589 (Mount Rushmore), 0019 (Sagrada Familia), 0008 (Piazza San Marco), 0032 (Florence Cathedral), 0063 (Milan Cathedral). Additionally, 0015 and 0022 contain our validation scenes (which are not banned, but may bias your models). Participants should ensure that this information is correct: the organizers take no responsibility for it.

Regarding our test set

Using the challenge test set in any way other than to produce the submission as instructed by the organizers is prohibited. Scraping the test data or obtaining it from other sources and using it for training, validation, or any other purpose is prohibited. Using images showing the scenes or landmarks present in the test set is prohibited, even if images are different from ones in the test set. For example, no image depicting London Tower Bridge (a.k.a. "london_bridge") can be used for training, validation, or any other purpose. This includes pictures or drawings even if they were made by participants themselves.

Prizes

Prizes (to be announced later) are available thanks to our sponsors and will be awarded at the conference, or sent by post after it. Due to trade sanctions, we are unable to award prizes to residents of the following countries/regions: Cuba, Iran, North Korea, Sudan, Syria, and Crimea. Challenge organizers may participate in the challenge but are not eligible for prizes.

Citation

Last by not least, if you find the benchmark/challenge useful, please cite this paper:

@article{jin2021image,
  title={Image matching across wide baselines: From paper to practice},
  author={Jin, Yuhe and Mishkin, Dmytro and Mishchuk, Anastasiia and Matas, Jiri and Fua, Pascal and Yi, Kwang Moo and Trulls, Eduard},
  journal={International Journal of Computer Vision},
  volume={129},
  number={2},
  pages={517--547},
  year={2021},
  publisher={Springer}
}