OSINT for Penetration Testers: A Practical Introduction to Passive Reconnaissance

Time it takes to read this article 6 minutes.

Legal & ethical disclaimer. This article is for education and authorized security testing only. Open-source intelligence (OSINT) gathering is often passive, but querying third-party services, scraping, and enumerating targets can still violate terms of service, computer-misuse laws (e.g. the U.S. CFAA, the UK Computer Misuse Act), or your engagement scope. Only collect intelligence about assets you own or are explicitly contracted in writing to assess. Respect data-protection regulations (GDPR, etc.) when handling personal data.

Table of contents

Introduction / Overview
How it works / Background
Prerequisites / Lab setup
Attack walkthrough / PoC
Mermaid diagram
Detection & Defense (Blue Team)
Conclusion
References

Introduction / Overview

Open-source intelligence (OSINT) is the practice of collecting and correlating publicly available information to build a picture of a target before you ever send a packet at their production systems. In the kill chain it maps to Reconnaissance — MITRE ATT&CK tactic TA0043 — and it is the phase that most cheaply determines whether the rest of an engagement succeeds.

In this article you will learn how to run a structured passive recon workflow using five staple tools: theHarvester, recon-ng, Google dorks, Shodan, and Maltego. We will collect subdomains, e-mail addresses, exposed services, and credential leaks, then pivot the findings into a graph. Finally — with equal weight — we will cover how a blue team detects and shrinks this attack surface.

How it works / Background

OSINT divides cleanly into passive and semi-passive collection:

Passive: you never touch the target's infrastructure. You query search engines, certificate-transparency logs (crt.sh), DNS aggregators, code repositories, breach databases, and registries (WHOIS/RDAP). The target sees nothing.
Semi-passive: you touch the target indirectly — for example resolving a subdomain or pulling an HTTP banner — which generates traffic but looks like normal user activity.

The core data sources are remarkably consistent across tools:

Source	What it yields
Certificate Transparency (crt.sh, Censys)	Subdomains, internal hostnames
Search engines (`site:`, `filetype:`)	Indexed files, login portals, errors
Shodan / Censys	Open ports, banners, products, CVEs
WHOIS / RDAP	Org, registrant, name servers
Breach corpora (HIBP)	Leaked e-mail/password pairs

theHarvester and recon-ng are essentially collectors and correlators that wrap dozens of these APIs behind one interface.

Prerequisites / Lab setup

Use Kali Linux (or any Debian-based distro). Most of these ship pre-installed on Kali; otherwise:

# theHarvester
sudo apt install theharvester        # or: pipx install theHarvester

# recon-ng
sudo apt install recon-ng            # or: pipx install recon-ng

# Shodan CLI
pipx install shodan
shodan init <YOUR_API_KEY>           # from account.shodan.io

# theHarvester
sudo apt install theharvester        # or: pipx install theHarvester

# recon-ng
sudo apt install recon-ng            # or: pipx install recon-ng

# Shodan CLI
pipx install shodan
shodan init <YOUR_API_KEY>           # from account.shodan.io

Bash

Register free API keys to dramatically improve results: Shodan, Hunter.io, VirusTotal, GitHub (a fine-grained read-only PAT), and SecurityTrails. In recon-ng these go into the keystore:

recon-ng
[recon-ng][default] > keys add shodan_api <KEY>
[recon-ng][default] > keys add hunter_io <KEY>
[recon-ng][default] > keys list

recon-ng
[recon-ng][default] > keys add shodan_api <KEY>
[recon-ng][default] > keys add hunter_io <KEY>
[recon-ng][default] > keys list

Bash

For a safe, ownable target throughout this guide we use the deliberately public test domain example.com — substitute your authorized scope.

Attack walkthrough / PoC

1. theHarvester — fast surface sweep

theHarvester aggregates subdomains and e-mails from many engines in one shot. The -b flag selects backends (data sources):

# Enumerate subdomains and e-mails using crt.sh, DNS dumpster, and Bing
theHarvester -d example.com -l 500 -b crtsh,bing,dnsdumpster -f harvest_example

# Use all available sources
theHarvester -d example.com -b all

# Enumerate subdomains and e-mails using crt.sh, DNS dumpster, and Bing
theHarvester -d example.com -l 500 -b crtsh,bing,dnsdumpster -f harvest_example

# Use all available sources
theHarvester -d example.com -b all

Bash

-l limits results per source, -f writes an HTML/JSON/XML report. The output gives you a seed list of hosts and addresses to feed into the next stages.

2. recon-ng — modular, repeatable enumeration

recon-ng is a Metasploit-style framework: workspaces, a database, and modules. A typical subdomain-to-host pivot:

recon-ng
[recon-ng][default] > workspaces create example
[recon-ng][example]  > db insert domains
domain (TEXT) > example.com

# Pull subdomains from certificate transparency
[recon-ng][example]  > marketplace install recon/domains-hosts/certificate_transparency
[recon-ng][example]  > modules load recon/domains-hosts/certificate_transparency
[recon-ng][example]  > run

# Resolve discovered hosts to IPs
[recon-ng][example]  > modules load recon/hosts-hosts/resolve
[recon-ng][example]  > run

# Show what we gathered, then export
[recon-ng][example]  > show hosts
[recon-ng][example]  > modules load reporting/csv
[recon-ng][example]  > run

recon-ng
[recon-ng][default] > workspaces create example
[recon-ng][example]  > db insert domains
domain (TEXT) > example.com

# Pull subdomains from certificate transparency
[recon-ng][example]  > marketplace install recon/domains-hosts/certificate_transparency
[recon-ng][example]  > modules load recon/domains-hosts/certificate_transparency
[recon-ng][example]  > run

# Resolve discovered hosts to IPs
[recon-ng][example]  > modules load recon/hosts-hosts/resolve
[recon-ng][example]  > run

# Show what we gathered, then export
[recon-ng][example]  > show hosts
[recon-ng][example]  > modules load reporting/csv
[recon-ng][example]  > run

Bash

Because everything lands in a SQLite database, recon-ng excels at chaining: hosts feed resolvers, resolvers feed port-scan reporting, and contacts feed breach-lookup modules.

3. Google dorks — surgical search-engine queries

Dorking uses advanced operators to surface content the target probably did not intend to expose:

site:example.com -www                          # subdomains indexed by Google
site:example.com filetype:pdf                  # exposed documents
site:example.com inurl:admin | inurl:login     # login portals
site:example.com intitle:"index of"            # open directory listings
site:example.com ext:sql | ext:env | ext:log  # leaked configs and dumps
"example.com" site:pastebin.com                # paste leaks
site:github.com "example.com" password         # credentials in public repos

site:example.com -www                          # subdomains indexed by Google
site:example.com filetype:pdf                  # exposed documents
site:example.com inurl:admin | inurl:login     # login portals
site:example.com intitle:"index of"            # open directory listings
site:example.com ext:sql | ext:env | ext:log  # leaked configs and dumps
"example.com" site:pastebin.com                # paste leaks
site:github.com "example.com" password         # credentials in public repos

Plaintext

The Google Hacking Database (GHDB) at exploit-db.com curates thousands of vetted dorks. Pair these with GitHub code search (org:ExampleCorp AKIA for AWS keys, filename:.env).

4. Shodan — the search engine for devices

Shodan indexes service banners across the IPv4 space, so you can find exposed assets without scanning them yourself:

# Everything Shodan knows about an org's IP
shodan host 93.184.216.34

# Search facets: exposed RDP on a netblock
shodan search 'net:203.0.113.0/24 port:3389'

# Find an org's Internet-facing assets
shodan search 'org:"Example Corp" http.title:"login"'

# Hosts vulnerable to a named CVE
shodan search 'vuln:CVE-2021-44228 org:"Example Corp"'

# Everything Shodan knows about an org's IP
shodan host 93.184.216.34

# Search facets: exposed RDP on a netblock
shodan search 'net:203.0.113.0/24 port:3389'

# Find an org's Internet-facing assets
shodan search 'org:"Example Corp" http.title:"login"'

# Hosts vulnerable to a named CVE
shodan search 'vuln:CVE-2021-44228 org:"Example Corp"'

Bash

Shodan tags banners with detected CVEs (e.g. CVE-2021-44228 / Log4Shell), turning recon directly into a vulnerability shortlist. Censys offers a comparable certificate- and host-centric dataset.

5. Maltego — visual link analysis

Maltego turns flat lists into a graph. You drop an Entity (Domain example.com) onto the canvas, then run Transforms — small queries that expand one entity into related ones: domain to MX records, domain to subdomains (via crt.sh/PassiveTotal), e-mail to breach via Have I Been Pwned, person to social profiles. The result is a relationship map that exposes pivots a text list hides — for instance a shared name server linking two "unrelated" target companies.

Mermaid diagram

OSINT for Penetration Testers: A Practical Introduction to Passive Reconnaissance diagram 1

The flow: collect names and emails, store them in recon-ng, resolve to IPs, enrich with Shodan service/CVE data, then correlate everything in Maltego to produce a prioritized attack surface.

For deeper post-recon steps, see Subdomain Enumeration & Takeover and, once you have valid users, Password Spraying Against Microsoft 365.

Detection & Defense (Blue Team)

Passive OSINT against third-party services is largely invisible to the target, so defense is about reducing what is collectable and monitoring the indirect signals that remain.

Attack-surface reduction

Audit Certificate Transparency yourself. Subscribe to CT-log monitoring (crt.sh feeds, Cert Spotter, Facebook CT monitor) so you discover new/forgotten subdomains the same way attackers do. Decommission stale hosts.
Scan as the adversary does. Run periodic Shodan/Censys queries on your own ASN and IP ranges (org: / net: facets) and use Shodan Monitor alerts to flag newly exposed ports. Close management interfaces (RDP 3389, SSH 22, databases) behind a VPN or zero-trust proxy.
Hunt your own leaks. Continuously scan GitHub and pastes for your domains and secret patterns using gitleaks, trufflehog, or GitHub secret scanning + push protection. Rotate any key that appears (e.g. an AKIA... AWS key) immediately.
Minimize WHOIS/DNS exposure. Enable WHOIS privacy, and avoid descriptive internal hostnames (vpn-prod-finance.example.com) in public DNS or certificates.

Monitoring the semi-passive edge

Detect dorking and scraping in web logs: bursts of 404s on /admin, /.git/, /.env, /backup.sql, or User-Agent strings from known tools. Return generic 404s and remove index of directory listing (Options -Indexes / autoindex off).
Block sensitive indexing with X-Robots-Tag: noindex and proper auth — never rely on robots.txt, which is itself a recon map.
Credential-leak response. Feed Have I Been Pwned Domain Search into your IAM workflow; force resets and enforce phishing-resistant MFA so leaked passwords from theHarvester/recon-ng results are dead on arrival.

Relevant ATT&CK coverage: detection maps to mitigation M1056 (Pre-compromise) and the active-defense techniques under Reconnaissance (TA0043) such as T1593 (Search Open Websites/Domains), T1596 (Search Open Technical Databases — Shodan/Censys, WHOIS, CT), and T1589 (Gather Victim Identity Information).

Conclusion

OSINT is the highest-ROI phase of an engagement: theHarvester and recon-ng give breadth, Google dorks and Shodan give depth, and Maltego turns the noise into pivots. The same techniques are available to defenders — the team that scans its own CT logs, IP ranges, and code repositories first removes the very findings an attacker would have weaponized. Treat external attack-surface management as a continuous control, not a one-off pentest deliverable. For the next phase, continue to Active Recon & Network Scanning with Nmap.

References

MITRE ATT&CK — Reconnaissance (TA0043): https://attack.mitre.org/tactics/TA0043/
MITRE ATT&CK — T1596 Search Open Technical Databases: https://attack.mitre.org/techniques/T1596/
theHarvester (laramies): https://github.com/laramies/theHarvester
recon-ng (lanmaster53): https://github.com/lanmaster53/recon-ng
Shodan documentation: https://help.shodan.io/
Google Hacking Database (GHDB): https://www.exploit-db.com/google-hacking-database
Maltego documentation: https://docs.maltego.com/
HackTricks — External Recon Methodology: https://book.hacktricks.xyz/generic-methodologies-and-resources/external-recon-methodology
Have I Been Pwned: https://haveibeenpwned.com/

Introduction / Overview

How it works / Background

Prerequisites / Lab setup

Attack walkthrough / PoC

1. theHarvester — fast surface sweep

2. recon-ng — modular, repeatable enumeration

3. Google dorks — surgical search-engine queries

4. Shodan — the search engine for devices

5. Maltego — visual link analysis

Mermaid diagram

Detection & Defense (Blue Team)

Conclusion

References

Comments