EECE 5642 Data Visualization
Midterm Project Instructor: XXXXX
Submission Due Date: 11:59 pm Oct. 20 Progress Review Date: Oct. 8 Submission: Blackboard
Qustions
All the details and backgrounds for our midterm project could be found in the “Lecture- Midterm Project” slide. We list the requirements for this project as follows.
- Preprocess 20 Newsgroup dataset as corpus and visualize its statistical information.(10’)
- Buildtwo different vocabularies upon different preprocessing ways; Learn Bag-of-words (BoW) and TF-IDF model with each vocabulary (10’)
- Train two LDA models upon the vocabularies in Step 2; Visualize topics with four different methods; and eventually get the topic distribution (as feature) for each document.(20’)
- Train two Doc2Vec models upon the vocabularies in Step 2; Visualize your learned word and document embedding space; Collect Doc2Vec representation of each document.(20’)
- Conduct document clustering by K-means with four different doc. representations: 1) BoW; 2) TF-IDF;3) Topics distribution; and 4) Compare different results by Normalized Mutual Information (NMI) and visualize the clustering results. (20’)
- Do experiment analysis from the following aspects: 1) Impact of different preprocessing ways (g.,how to filter vocabulary; using n-gram model); 2) Impact of different topic numbers; and
- 3) Different training methods for Doc2Vec; 4) What’s the key factor for doc. visualization? (20’)
- Learn document representation beyond the above ones. For example, how to use temporal context in a document?(Bonus)
Every group is required to give a progress review presentation with slides in our class. The presentation time is about 5-10 minutes. Each talk should include the following contents.
- Introduction to your group members and team
- A clear illustration for your project
- All the experimental results you have obtained by the presentation
- A live demo for your visualization result (Jupyter Notebook isrecommended).
The final submission is required as follows.
- Atwo-page pdf report including 1) a brief introduction to the project and your method; 2) all the necessary results and analyses; 3) references for the tools and papers you used in this work (the references could be put into an extra page, which contains nothing but references).
- A package file including all your source codes and visualization
Hint: We do not require the format (e.g., single-column or double-column, font size and line space) for the final report. However, you need to make sure it is neat and readable. Some good and highly recommended (Word, Latex) templates could be found from IEEE Transactions or ACM conference.