__  
   _______  ______  ___  _____________  ____ ___________/ /_ 
  / ___/ / / / __ \/ _ \/ ___/ ___/ _ \/ __ `/ ___/ ___/ __ \
 (__  ) /_/ / /_/ /  __/ /  (__  )  __/ /_/ / /  / /__/ / / /
/____/\__,_/ .___/\___/_/  /____/\___/\__,_/_/   \___/_/ /_/ 
          /_/                                                


WARNING: there is no pagination for results. You have been warned!

Read this manual or you'll be directed back to it anyway if you ask something answered here.

You are searching through links on archive.org. But wait... there's more!
These are specific sites which we have indexed all *available* files on them with the purpose
of making it easier to find "lost" files. Well, hopefully that will happen, anyway...

The "specific" part refers to old school VR and graphics mostly, but you will find a megaload of other stuff in here, too.

This engine searches the text within the URLs in our database for specific keywords and returns the results,
it does not load the URL itself to search. It searches flat files and it's a little slow, but it's fine for this purpose,
and it's still quicker than doing it manually.

Everything is pulled manually with custom software we've written to gather the data in a specific way and then filter it.
The results you see are only HTTP status code 200 (files should all be there), as opposed to 404 (file not found) or other HTTP status codes.
Still, the crawlers have stated they have files sometimes when they do not, though this is rare.
The data is then uploaded here to become searchable.
All indexes are pulled in their entirety with no date limits unless specified (keep reading, see below).
All data is available to you as JSON, one 'dictionary' per line. You can download the data as valid JSON and search it yourself,
simply click the [data] button on the search form to browse our data sets.
See code examples at the bottom of this document about how to query the data using Python.

The data /should/ be every *available* file on the site (meaning: we ignore 404s), but stuff happens, so let us know if there's an issue.
Also, if the file is 404 it's fault of whichever bot scraped it. We are simply making the indexes available.
You can submit requests for sites to index so long as they meet the theme (vintage VR mostly, but try me).
If you have a specific research project of interest to us we may be able to provide you software access.
If you don't want to wait or ask you can use wayback_machine_downloader instead.
Contact admin @ this domain or talk to me on Discord with bugs/requests.

My philosophy behind data files that are labeled "up to (date)":
when you see a file like 'example.org_(2004)' it means it was indexed from the beginning up until that date.

The end date is decided by these factors: 
  1) if the site vanished and then became something else, we don't care about the "something else".
  2) if something drastically changed. Like, some VR company started making servers instead.
  3) If it falls outside the period of interest. Generally you won't see anything later than 2008 here. The reason for a date that late
     is that the crawler fails to grab all files all the time, and if it hasn't grabbed files by that point then they're probably gone.
  4) if there is a megaton of files that will spam the index and might not be relevant.
  5) There can be other reasons, but this is what works for what most people I know are looking for.

DO NOT HAMMER THE SEARCH PAGE OR IT WILL BE MADE PRIVATE
Meaning, don't keep refreshing the page or otherwise abuse it. This is a research tool.
Don't do 1 character searches ("x", for example), and preferably not 2 either without great reason as it's pointless and wastes resources.
Don't search for oddly unspecific things like ".com", ".net" and so on (file extensions are fine, though), it's futile and a waste.
The resources are artificially limited on this script so you can't DoS the site, but please still use it wisely.
Don't scrape the engine, instead you should download the datasets from the [data] button on the search page.
It's free and WAY more efficient than scraping, try it!
See code examples at the bottom about how to query the data using Python.
If you disregard the rules you may be banned and this page may become passworded.

EXAMPLE SEARCHES (case sensitive):
cats       - returns any link containing 'cats'
devel5.zip - returns any link containing the string 'devel5.zip'
.exe       - returns any link containing '.exe'
/index/    - returns any link containing '/index/'
/INDEX/    - returns any link containing '/INDEX/' but not '/index/' or '/iNdEx/'
Magic      - returns precisely 'Magic' but not 'magic' or 'MAGIC'
$%_!/      - returns '$%_!/'

EXAMPLE SEARCHES (case INsensitive):
!CaTs      - returns 'CaTs', 'CATS', 'cats' or any variation thereof
!.EXE      - returns 'EXE', 'exe' and so on...
!OH!       - returns 'OH!', 'oh!' and so on...

Note that '.exe' will also return 'www.executive.gov', for example, and not just 'game.exe'.

You can also click the 'Aa' checkbox to perform case INSENSITIVE searches.
**The default search is case sensitive** and this is done on purpose and no I won't change it.

WHY DOES THIS EXIST?
It sucks trying to find lost files. If I'm researching stuff I may as well save the data for later so
we can all look for other desirable files instead of doing the same thing over and over again.
The idea is this could potentially save us all a lot of time and hopefully unearth lost treasures.
The command line tools I have are far more powerful than this, this is the way I felt like sharing it for now.
It's also kinda fun to use the engine to look for files quickly instead of the cli.
Remember to submit cool stuff using the 'submit' button at the top, and you can see what's been submitted
by clicking 'finds'. Many sites here were mentioned or given by other people.

NAMING CONVENTIONS
The naming convention for the data files is the domain and path (if a path is used).
Underscores in filenames are forward slashes, underscore followed by a hyphen is forward slash + tilde (/~).
If the filename ends in _(YYYY) (example: example.com_thing_(2009)) it means it was only indexed up TO that date.
These will appear under the [what] button as 'example.com-up-to(2006)', which is different from the filename.
I don't anticipate ever needing to use a "from" date given the focus of this project.
All grabs are from the beginning to the end unless an end date is specified.

LIMITATIONS
Your computer may crash from lack of pagination of results. You've been warned now twice ;)
This will probably never be implemented, because I don't care :D
If you do something dumb like search ".com" your browser will probably become unresponsive and your OS may crash.
There is no spam filter on submissions, don't abuse it & beware of what you click.
We are not some massive data hoarding center (jklol, yes we are), but this server is limited and
eventually this thing will run out of space to keep new sites.
Eventually it will also run out of resources and possibly burst into flames, so enjoy it while it lasts.
Generally, there is no guarantee of completeness or reliability of data. Shit happens. Use at your own risk.


Lastly, the search box tells you the date/time the last dataset was added when you load the page initially.
It is currently updated fairly frequently. Keep checking back.
The dropdown will be used to segregate different types of searches. The "web" will always be VR related,
but the FTP and others will be random. I will not put any VR stuff under any other category than "web".
You can get the FTP data here: https://superscape.org/search/ftp/data
Local filesearch will be coming soon so you can search our mirrors and files 
https://superscape.org/files/
https://superscape.org/files/mirrors/


EXAMPLE CODE
Python script to query the datasets:
import json
import sys

case_insensitive = True  # default

filename = input("Filename: ")
search   = input("Search term: ")

try:
    with open(filename, 'r') as f:
        a = f.read().rstrip('\n')
except:
    print("File not found or other file access error")
    sys.exit(1)

n = json.loads(a)

for x in n:
    if case_insensitive == True:
        if search.lower() in x['original'].lower():
            print("found " + str(x['original']))
    else:
        if search in x['original']:
            print("found" + str(x['original']))