Review Article

The World Bacterial Biogeography and Biodiversity through Databases: A Case Study of NCBI Nucleotide Database and GBIF Database

Algorithm 2

Biodiversity and Biogeography—NCBI_Nucleotide_Tracker.
Definition part:
   Connection variables (undertaken by Biopython package)
   Bacteria phyla (bacteria_main_groups)
List of geographical areas (list from file: countries_list_all.txt) see supplementary
materials.
   The query structure (term = “country AND Geographical area’s name AND
Bacteria [Organism] AND Date of publication”)
   gi_list (list of records verifying the query structure)
   listWC (number of records with the existence of the qualifier/country)
   lisV (number of records with a real/country qualifier attributed to the right
geographical area)
   // all variables are set at zero (0) or an empty list.
Define treatments and operations:
   For every geographical area form the list found in “countries_list_all.txt”:
      (i) Query the NCBI database, using the query structure.
      (ii) Retrieve the count of gi_list
      (iii) Retrieve all the records (Genbank format) one by one
      (iv) Access each record:
         If the qualifier/country exists then:
             + 1
            If the qualifier value matches the geographical area of
            interest:
                + 1
               Check for the taxonomy:
               Count the sequence regarding the appropriate phylum.
               If there is not taxonomy for the sequence (no
               bacteria) then register the GI in
               file “geographical_area_Absence_Bact.txt”, see
               supplementary materials.
   Save results for all records of the geographical area on a row in the result file
(country_all.txt) see supplementary materials.
   Remove the geographical area from the list of geographical areas.
If any errors occurred, save the error type in “error.txt”, see supplementary materials.