Found and liberated 2,151 missing DHS files

On January 18, 2017 the US Department of Homeland Security discontinued its Daily Open Source Infrastructure Report service which it had run since October 2006. To enable researchers to study the content of these reports, I collected as many as I could find (2,151 PDF files) and released them to the Internet Archive. You can find them here: DHS Daily Open Source Infrastructure Reports 2006-2017

The PDF files came from the following URLs:

  • https://www.dhs.gov/sites/default/files/publications/
  • https://www.dhs.gov/sites/default/files/publications/nppd/ip/daily-report/
  • https://www.dhs.gov/xlibrary/assets/

And when these yielded 404 errors (which they did for most pre-2013 files) I used the Internet Archive itself, with the following URL base:

http://web.archive.org/web/20061101153326/https://www.dhs.gov/xlibrary/assets/[filename]

Files are named as they were upon download, in one of the following patterns:

  • DHS_Daily_Report_2006-10-11.pdf (most 2006-2012 files have this format)
  • DHS-Daily-Report-2012-12-06.pdf (a single December 2012 file has this format)
  • dhs-daily-report-2013-01-09 (most 2013-2017 files have this format)

If you are interested in missing dates (for example Archive.org was missing some dates and a few files were corrupted), this blog might be able to help fill in the gaps.

 

Linguistic analysis of a local neo-confederate group’s Facebook posts

Earlier I showed how to extract the postings from a given Facebook page. Here, I will show you how to do some basic text mining on the posts you found. For practice, I will use the messages of a local neo-Confederate group called ACTBAC (“Alamance County Taking Back Alamance County”). Their antics have been covered in local media, but with their re-branding in light of the Trump election and the rise of the alt-right, many people in our area are still wondering just what this group is all about. Perhaps text mining can help illuminate some of their beliefs and strategies for us.

Overview

I ran the script on their “ALAMANCEOURS” Facebook page, and it yielded 1017 messages beginning in June, 2015. Here is the spreadsheet (actbac.csv) in case anyone wants to play around with it.

Top 50 words most used in their FB posts

I wrote a program to count frequencies and remove stopwords (stopwords are boring words like ‘a’, ‘to’, ‘it’, ‘is’). Then I highlighted the most interesting words (to me) in yellow. Each word is shown with its count next to it.

From these, we can see many predictable words for a county-based neo-Confederate group (county, state, southern, cause, carolina). However, I was most intrigued by the prominence of the word ‘stand’.

Usage of the word ‘stand’

Stand can be both a noun (“take a stand”) and a verb (“stand up for yourself”). With this group, ‘stand’ is the most common verb used in their messages (not counting stopwords like ‘be’ or ‘is’). My hypothesis is that, as a verb, this word ‘stand’ conveys a lot of the power of their movement. Why?

To help understand how they use ‘stand’, I wrote a program to generate a concordance to show how the word is used in their messages. The first few lines of the concordance look like this:

The word of interest (shown in red) is placed in the center of each line. The concordance then shows each collection of words around that word.

From this, I learned that the word ‘stand’ is used 291 times in 1017 messages, most commonly as follows:

In addition, there are another 41 uses of “stood” and 86 uses of “standing”.

It would be interesting to compare this usage to other Confederate and non-Confederate groups to see whether this is a uniquely ACTBAC thing (I doubt it), or – more likely – it is a rhetorical device used more broadly by all Confederate groups. I would guess that their defensive “stand up for your beliefs, no matter how unpopular” plea has great power in a neo-Confederate setting. After all, the “Lost Cause” narrative also describes a heroic, virtuous South fighting against all odds, and ultimately unfairly defeated in the American Civil War.

Topic modeling

Next, just for fun, I wrote a program to build a topic model of the postings. A topic model tells us what words frequently co-occur in sentences, and tries to make groupings of those words into possible “topics”. Inside the program, you can fiddle with the number of topics, and the number of words generated for each.

After running a few experiments, I settled on 3 topics with 4 words each. These topics weren’t terribly interesting, as you can see below, but we can still learn a few interesting things. First, when ‘stand’ is mentioned, it is often used with ‘southern’ and ‘state’, and it seems to be ‘people’ who are doing the standing (makes sense). Additionally, the topic we could call ‘Confederate battle flag’ emerges (labeled Topic 3 below):

Text Difficulty

Finally, I looked at how difficult the text was to read. These are fairly simple analyses based on sentence structure, number of “difficult” words, and how many syllables are in the words.

The FKRE is the Flesch-Kincaid Reading Ease metric, which tells you how “easy” a document is to read, and then this number (71.55, or “fairly easy”) can be converted to a grade level metric (7th grade). I also ran an overall readability summary, which integrates several other difficulty measures in addition to FKRE. That one also puts this text at right around 6th or 7th grade.

I hope you enjoyed this quick tour of text mining – perhaps you will find some interesting techniques to use on your own projects!

Analysis of the latest 1000 Facebook posts by the Times-News (Aug-Dec)

I was playing around with some code today from Mastering Social Media Mining with Python (by Marco Bonzanini, and published by the same company that published my last two books), and I came up with this snazzy set of scripts (postGetter.py, fileParser.py) that mines the last X posts from any public Facebook page, creates a clickable FB url for each, sorts them in order of most interactions (shares + likes), and creates a spreadsheet with the results.

Here are the results when run for the last 1000 posts by the Times-News of Burlington, our local newspaper: timesNews.csv.

Findings?

Not that surprising or shocking, but here goes. The last 1000 only goes back to August or so (modify the params at the top of the code to make it scrape more), but the top five posts for August-December based on interactions seem to be:

  1. The death of Tim-Bob from Graham Cinema
  2. The abduction of a middle schooler from a bus stop
  3. Kmart closing
  4. 25-minute Christmas Lights show on Maple Ridge Dr.
  5. Housing emergency at Burlington Animal Services

No election-related or weather-related items cracked the top 20.

The NC GOP & Pointy-headed Professors

The other day on This American Life, Dallas Woodhouse, executive director of the North Carolina GOP, dismissed evidence that vote fraud in the state was basically non-existent:

Don’t show me studies. Academics, I mean, a bunch of knuckleheads, pointy-headed professors. We deal in the real world.

Since I’ve done prior academic work on insults, I was very intrigued at the possibility of my being simultaneously pointy-headed and knuckle-headed, living in an unreality where I and my cranially-challenged colleagues churn out reams of useless studies in order retain Total World Domination.

As it turns out, the origin of knucklehead was a U.S. Army PR/recruitment program’s Goofus-type character (like the old Highlights magazine “Goofus and Gallant”) named R.F. Knucklehead. He was never portrayed as smart, and was always making bad decisions. Here is a cartoon showing Aviation Cadet Knucklehead working hard at signing a simple signature:

Incidentally, knucklehead was also the word chosen by President Obama to describe the prostitute-hiring Secret Service agent shenanigans in Cartagena.

The origin of pointy-headed was George Wallace in 1968 (good company you’re keeping there, Dallas). The word is a play on the shape of an egg, as in eggheadThe Washington Post explains the Wallace usage:

He sneered from the campaign podium at the “long-haired men and short-skirted women” of the 1960s and derided “pointy-head college professors who can’t even park a bicycle straight.”

I wonder what happened in the bicycle parking lot between Wallace and some unlucky academic. We’ll never know. But The New York Times brings back the “pointy-head” quote for our times, comparing Wallace’s use of anti-intellectual populist insults to Trump’s.

Chronicle of Higher Ed ran an article back in 2012 listing additional stereotypes that politicians use to describe the hated professor. The article includes a really nice egghead pun from Adlai Stevenson (“Eggheads of the world, unite! You have nothing to lose but your yolks.”), who was often criticized for being one himself.

Interestingly, Google users seem to think differently about professors (for those wondering if these results were influenced by my login, they weren’t: this was an incognito browser window).

Anyway, I hope this little etymological excursion shows what a professor does when she hears something idiotic: we brush aside the insult and instead we ask lots of questions, look up the answers, synthesize the results into a conclusion, maybe ask additional questions, cite our sources, then teach what we learned to others.