« Back to all events

Masterclass on Web Scraping and Text Mining

Dates:
  • Mon 14 May 2018 09.00 - 13.00
  • Tue 15 May 2018 09.00 - 13.00
  • Wed 16 May 2018 09.00 - 13.00
  • Thu 17 May 2018 09.00 - 13.00
  • Fri 18 May 2018 09.00 - 13.00
  Add to Calendar 2018-05-14 9:00 2018-05-18 13:00 Europe/Paris Masterclass on Web Scraping and Text Mining

Outline:
This course will introduce students to the data science fundamentals of extraction, processing and classification of web content. It will review current methods for automated web scraping, natural language processing for parsing unstructured data and machine learning algorithms for textual data. With this in mind, the first part of the course will provide an in-depth survey of different structures and features of web content (XML, JSON, HTML, CSS-tags and XPATH) and cover the main tools for harvesting, extracting and processing the data retrieved into structured formats, using static and dynamic web pages and APIs. In a second stage, we will explore applications of machine learning algorithms to the parsed data, with a particular focus on text analysis. Under the umbrella of supervised and unsupervised learning, the course will cover traditional approaches to content analysis and dictionary-based methods, machine learning algorithms for classification, scaling methods and topic modeling. Our goal is to help students automate the extraction of online content, parse the unstructured data into formats amenable to analysis and produce quantities of interest using classification and data reduction methods, using text as data for the most part. The course will be taught in R, but we may also touch upon Python libraries for particular applications.

Seminar Room 3, Badia Fiesolana DD/MM/YYYY
  Seminar Room 3, Badia Fiesolana

Outline:
This course will introduce students to the data science fundamentals of extraction, processing and classification of web content. It will review current methods for automated web scraping, natural language processing for parsing unstructured data and machine learning algorithms for textual data. With this in mind, the first part of the course will provide an in-depth survey of different structures and features of web content (XML, JSON, HTML, CSS-tags and XPATH) and cover the main tools for harvesting, extracting and processing the data retrieved into structured formats, using static and dynamic web pages and APIs. In a second stage, we will explore applications of machine learning algorithms to the parsed data, with a particular focus on text analysis. Under the umbrella of supervised and unsupervised learning, the course will cover traditional approaches to content analysis and dictionary-based methods, machine learning algorithms for classification, scaling methods and topic modeling. Our goal is to help students automate the extraction of online content, parse the unstructured data into formats amenable to analysis and produce quantities of interest using classification and data reduction methods, using text as data for the most part. The course will be taught in R, but we may also touch upon Python libraries for particular applications.


Location:
Seminar Room 3, Badia Fiesolana

Affiliation:
Department of Political and Social Sciences

Type:
Workshop

Organiser:
Professor Elias Dinas (EUI - Department of Political and Social Sciences)

Contact:
Jennifer Rose Dari (EUI - Department of Political and Social Sciences) - Send a mail

Speaker:
Paulo Serôdio (Oxford University)
 
 

Similar events

 

Page last updated on 18 August 2017