Run Elastic Open Crawler in Windows with Docker

Open Crawler does not have official Windows support, but that doesn’t mean it won’t run in Windows! In this blog, we will explore using Docker to get Open Crawler up and running in your Windows environment.

We are going to explore two different ways of downloading and running Open Crawler on your system. Both methods will rely on Docker, and the instructions will be quite similar to what can be found in Open Crawler’s existing documentation. However, we will be sure to point out the (very minor!) modifications you must make to any commands or files to make standing up Open Crawler a smooth experience!

Prerequisites

Before getting started, make sure you have the following installed on your Windows machine:

git
Docker Desktop
Docker Desktop CLI (included with Docker Desktop)
Docker Compose (included with Docker Desktop)

You can learn more about installing Docker Desktop here.

Furthermore, this blog assumes version 0.3.0 or newer of Open Crawler. Using the :latest tagged Docker image should result in at least version 0.3.0 as of the time of writing.

Creating a configuration YAML

Before getting into the different ways of getting Open Crawler running, you need to create a basic configuration file for Open Crawler to use.

Using a text editor of your choice, create a new file called crawl-config.yml with the following content and save it somewhere accessible.

output_sink: console
log_level: debug

domains:
    - url: "https://www.speedhunters.com"

max_redirects: 2

Running Open Crawler directly via Docker image

Step 1: Pull the Open Crawler Docker image

First, you must download the Open Crawler Docker image onto our local machine. The docker pull command can automatically download the latest Docker image.

Run the following command in your command-line terminal:

docker pull docker.elastic.co/integrations/crawler:latest

If you are curious about all of the versions of Open Crawler that are available, or want to experience a snapshot build of Open Crawler, check out the Elastic Docker integrations page to see all of the available images.

After the command executes, you can run the docker images command to ensure the image is in your local images:

PS C:\Users\Matt> docker images
REPOSITORY                                              TAG                IMAGE ID       CREATED        SIZE
docker.elastic.co/integrations/crawler                  latest             5d34a4f6520c   1 month ago   503MB

Step 2: Execute a crawl

Now that a configuration YAML has been made, you can use it to execute a crawl!

From the directory where your crawl-config.yml is saved, run the following command:

docker run \
  -v .\crawl-config.yml:/crawl-config.yml \
  -it docker.elastic.co/integrations/crawler:latest jruby bin/crawler crawl /crawl-config.yml

Please be mindful of the use of Windows-style backslashes and Unix-style forward slashes in the command’s volume (-v) argument. The left-hand side of the colon is a Windows-style path (with a backslash) and the right-hand side has a forward slash.

  -v .\crawl-config.yml:/crawl-config.yml

The -v argument is mapping a local file (.\crawl-config.yml) to a path inside the container (/crawl-config.yml).

Running Open Crawler with docker-compose

Step 1: Clone the repository

Use git to clone the Open Crawler repository into a directory of your choosing:

git clone git@github.com:elastic/crawler.git

Tip: Don’t forget, you can always fork the repository as well!

Step 2: Copy your configuration file into the config folder

At the top level of the crawler repository, you will see a directory called config. Copy the configuration YAML you created, crawl-config.yml, into this directory.

Step 3: Modify the docker-compose file

At the very top level of the crawler repository, you will find a file called docker-compose.yml. You will need to ensure the local configuration directory path under volumes is Windows-compliant.

Using your favorite text editor, open docker-compose.yml and change “./config” to “.\config”:

Before
	volumes:
  		- ./config:/home/app/config

After
	volumes:
  		- .\config:/home/app/config

This volumes configuration allows Docker to mount your local repository’s config folder to the Docker container, which will allow the container to see and use your configuration YAML.

The left-hand side of the colon is the local path to be mounted (hence why it must be Windows-compliant), and the right-hand side is the destination path in the container, which must be Unix-compliant.

Step 4: Spin up the container

Run the following command to bring up an Open Crawler container:

docker-compose up -d

You can verify either on Docker Desktop (in the Containers page) or by running the following command to verify the container is indeed running:

docker ps -a

Step 5: Execute a crawl command

Finally, you can execute a crawl! The following command will initiate a crawl in the running container that was just spun up:

docker exec -it crawler bin/crawler crawl config/my-config.yml

Here, the command is only using Unix-style forward slashes, because it is calling the Open Crawler CLI that resides inside the container.

Once the command begins running, you should see the output of a successful crawl! 🎉

PS C:\Users\Matt> docker exec -it crawler bin/crawler crawl config/crawler.yml
[crawl:684739e769ea23aa2f4aaeb5] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[crawl:684739e769ea23aa2f4aaeb5] [primary] Starting a crawl with the following configuration: <Crawler::API::Config: log_level=debug; event_logs=false; crawl_id=684739e769ea23aa2f4aaeb5; crawl_stage=primary; domains=[{:url=>"https://www.speedhunters.com"}]; domain_allowlist=[#<Crawler::Data::Domain:0x3d
...
...
binary_content_extraction_enabled=false; binary_content_extraction_mime_types=[]; default_encoding=UTF-8; compression_enabled=true; sitemap_discovery_disabled=false; head_requests_enabled=false>
[crawl:684739e769ea23aa2f4aaeb5] [primary] Starting the primary crawl with up to 10 parallel thread(s)...
[crawl:684739e769ea23aa2f4aaeb5] [primary] Crawl task progress: ...

The above console output has been shortened for brevity, but the main log lines you should look out for are here!

Conclusion

As you can see, it only takes a little mindfulness around Windows-style paths to make the Open Crawler Docker workflow compatible with Windows! As long as Windows paths use backslashes and Unix paths use forward slashes, you will be able to get Open Crawler working as well as it would in a Unix environment.

Now that you have Open Crawler running, check out the documentation in the repository to learn more about how to configure Open Crawler for your needs!

Ready to try this out on your own? Start a free trial.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!