Expand your professional network using Google Search and Python

expand-your-professional-network-using-google-search-and-python
playwright web-scraping

Motivation

Not long ago I started expanding my online network with professionals relevant to my field. This was not because I was seeking a job. I have always been curious to know what kind of business problems people are trying to solve using data, what the data architecture of a known company looks like, or even how the work-life balance of a specific business role is. In the same way, people share entertaining content from their personal life on a social media platform, they also share valuable content from their professional life on a professional social media platform. Interacting with professional content helps me get to know my field better, stay up to date with the latest industry trends or even discover potential mentors.

The platform I am using is Linkedin but I found it too limited in discovering professionals relevant to my field without using paid services. So I put together a method to discover professionals of specific seniority and field as well as to extract their details. As an example, in this guide we will explore how to find all senior data engineers (or better all that identify themselves as senior data engineers) in Greece and collect their details in a CSV file.

Google Search like a pro

All web pages as well as Linkedin Profiles are indexed by Google. As a result, Linkedin Profiles are searchable via Google. The profiles that we are interested in are all Senior Data Engineer profiles in Greece.

  1. We will start by searching on Google:
    **linkedin senior data engineer greece: **

1_google_search

There are a lot of problems with this search.

  • We got back 27 million profiles. We can say for sure, that our search is inaccurate since the whole population of Greece is 10 million people.
  • If we inspect the results, we mainly get job ads, social groups and books. Also, results are not exclusively from LinkedIn but also from other sites.
  • Google will never give us access to 27 million results with a single search. Search engines return limited results. Whenever the total data exceeds what the search engine returns, the search should be split up into smaller searches. Google’s algorithm typically allows you to see up to 400 results maximum.
  • In this case, we got 61 from the 27 million results which Google ranked as the best match for our search.

2_google_search_2

Most of these problems occur because the above query in reality translates to “Find a page with linkedin AND senior AND data AND engineer AND greece, somewhere in it“. So next we need to be more specific using operators (check out the official documentation).

  1. After refining a bit our search query we have: **site:gr.linkedin.com/in/ intitle:"senior data engineer" **

3_google_search_3

What we improved here:

- site operator: Limits the results to a particular site or pattern. In our case, we limited the results to LinkedIn profiles from LinkedIn Greece website (pattern : gr.linkedin.com/in/). - intitle operator: Find pages with particular words or phrase in the title. We asked google to return results with senior data engineer phrase in the title. - double quotes "": Search for the exact match. We asked google to return results with the exact senior data engineer in the title.

Again we can observe the following problems:

  • We expected to find more senior data engineers in Greece.

  • Finally, we can further improve our query by searching for: **site:gr.linkedin.com/in/ intitle:"data engineer" AND (lead OR senior OR head) **

4_google_search_4

What we improved here:

  • AND/OR operator: Combine searches into a single search. Since job position titles may vary, we included the synonyms lead and head of the word senior.

Now it seems that we achieved our goal since the results are closer to what we expected. The next step is to gather the results into a single CSV file.

Extract results using Python

To retrieve the results we need to build automation in Python, to gather everything into a CSV file. To achieve that we will use Playwright which is an open-source framework for browser automation developed by Microsoft in 2020.

You can get the full code here.

First, we need to install the required packages including Playwright and its dependencies.

pip install pandas
pip install playwright
python -m playwright install

Then to retrieve our results from the query we have previously defined as : site:gr.linkedin.com/in/ intitle:"data engineer" AND (lead OR senior OR head) we need to run the following script.

python google_retriever.py

A Google Chrome (Chromium) instance will open and the results from each page will incrementally be gathered into a file named profiles.csv. The fields that we will keep are the following.

  1. title – result title
  2. url – LinkedIn profile
  3. spec – result subtitle
  4. name – the first part of the title
  5. location – 1st part of the subtitle
  6. position – 2nd part of the subtitle
  7. company – 3rd part of the subtitle

― All fields may not be available for each result.

5_google_search_results

After successfully retrieving all results, Google Chrome (Chromium) instance will close and we will get the following output.

6_python_results

By now, every profile available on Google should be located in profiles.csv.

― There will be times when Google will block us, then we have to manually bypass the captcha or use a 3rd party service.

Conclusion

In this guide, we showed a way to collect the details of specific professional profiles so we can expand our network. Specifically, we explored how to improve the accuracy of our query in Google Search using operators and how to make use of a web automation framework in python to collect the query results.


If you enjoy reading stories like this and want to support me as a writer, please subscribe!