Site Meter

Down with the static

After six years of static HTML, it finally became apparent that this site needed a real CMS.  Please excuse the construction while we settle in to our new home.

Posted in Company News

Generating AWS CloudSearch SDF for Emails

  In my last post on CloudSearch and eDiscovery, I described something like “Google” for eDiscovery emails.  FedEx or DropBox your data to an eDiscovery service provider like myself, and rest assured that you’ll soon have a powerful, web-based user interface for searching and visualizing your digital discovery materials.

  As a technical follow-up to this post, I thought I’d share a proof-of-concept email parser based on the Enron email dataset.  The Python script below takes a directory of RFC822 email messages and returns an AWS CloudSearch JSON SDF with fields from the Date, From, To, Subject, and Body fields of the email.  There is no special handling for attachments or encoding in this example, but it can be used to populate a CloudSearch domain from the Enron emails. Sample usage below, as well as the output sample here.

$ python src/generateSDF.py "data/maildir/allen-p/inbox/*" | curl -X POST -d "@-" --header "Content-Type: application/json" doc-domain_name-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/documents/batch

Source code below the break.

Read more ›

Tagged with: , , , , , ,
Posted in Programming, Research

“Google” for subpoenaed emails: AWS CloudSearch for eDiscovery

  In the last post on AWS CloudSearch, I provided a tutorial on the creation of a simple CloudSearch domain for Supreme Court decisions.  This walkthrough described the steps of creating a domain, configuring access policies and indexing, populating the index, and using the search API.  We were left with a functioning case search database.

  From a technical perspective, one key difference between this example and many real-world applications is that we let the CloudSearch tools automatically decide what fields and content were available to search.  While this worked well in the previous example, I want to provide a concrete example of a context in which custom services and development are required.

  Imagine you’re a smaller law firm that specializes in HR disputes.  As part of a time-sensitive non-solicitation claim filed by your client, you’ve subpoenaed email from fifteen employees at a client’s competitor.   It’s Friday afternoon at 5PM, and you finally receive a hard drive with the emails.  However, in an effort to overwhelm your small team, the other party has dumped 10GB of data on your plate.  There’s no way you can search through this by hand.  You have a hearing on Wednesday, but need to prepare a strategy memo for your client by Monday morning.  Do you disappoint your client and motion to reschedule?  How could you possibly make the deadline?  If only you could just press a button and get something like Google for your data…

  Combined with the right service provider (like Bommarito Consulting!), AWS CloudSearch is a perfect solution for this problem.  Before CloudSearch, existing available on-site infrastructure constrained the provision of eDiscovery services.  eDiscovery service providers had to make large capital expenditures on servers and storage to meet peak customer needs, inflating the price paid by other customers.  Even if eDiscovery service providers were leveraging Infrastructure-as-a-Service (IaaS) provisioners like AWS EC2, there was still a significant amount of operations overhead required to manage variable customer demand.

  CloudSearch makes these problems disappear.  In our example above, building a “Google” for your subpoenaed emails can be done in just hours.  The core components are an RFC822 parser to populate the search domain and a front-end user interface for searching and visualizing the results.  If this service sounds valuable to your business, today or just prospectively, please feel free to call or email regarding a demo or additional information.

Tagged with: , , , , , , ,
Posted in Law, Programming, Technology

Building an AWS CloudSearch domain for the Supreme Court

  It should be pretty clear by now that two things I’m very interested in are cloud computing and legal informatics.  What better way to show it than to put together a simple AWS CloudSearch tutorial using Supreme Court decisions as the context?  The steps below should take you through creating a fully functional search domain on AWS CloudSearch for Supreme Court decisions.

Acquiring Supreme Court decision data

  Our first step is to acquire a public domain copy of Supreme Court decisions from Carl Malamud‘s resource.org.  You can navigate to this directory and download US.tar.bz2, or just run something like:

$ wget http://bulk.resource.org/courts.gov/c/US.tar.bz2

Once the download is done, extract the archive:

$ tar xjf US.tar.bz2

  We should now have a directory called US with 1.1GB and 62,839 files.  Let’s assume that you put this directory under something like /data/courts/US.

Setting up Cloud Search command line tools

  The next step is easy – go follow my guide on setting up Cloud Search command line tools!  I’ll assume that you placed everything under /opt/aws/cloud-search-tools, just like in that post.

Creating a Cloud Search Domain

  OK, we should now have a dataset and the Cloud Search API at our fingertips.  It’s time to create a Cloud Search “domain” that we can populate with records.  To do so, you can either follow the instructions on your AWS Management Console or run the following:

$ /opt/aws/cloud-search-tools/bin/cs-create-domain -d scotus

  This may take awhile to create; sometimes up to 15 minutes. Go grab a coffee or a beer and read your feed while you wait.  You can check the status either through the Management Console in browser or with the following line:

$ /opt/aws/cloud-search-tools/bin/cs-describe-domain -d scotus

  Once this step is complete, you should see an ACTIVE domain with 0 documents. We now need to reconfigure the access policies so that the domain allows us to submit search material and anyone to search:

$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow IP_ADDRESS --service doc
$ /opt/aws/cloud-search-tools/bin/cs-configure-access-policies -d scotus --update --allow all --service search

This policy change may take a few minutes to go into effect.

Lastly, we need to tell the domain what we are indexing per document.

$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name title --type text --option result
$ /opt/aws/cloud-search-tools/bin/cs-configure-fields -d scotus --name content --type text --option result

Populating the Cloud Search Domain

  OK, we’re ready to go!  At this point, we need to generate Search Data Format (SDF) files to populate the domain.  There are two approaches we can take:

  1. Write a parser to extract exactly the text content and metadata we want.
  2. Throw the pre-packaged cs-generate-sdf utility at our data and hope for the best.

  For brevity’s sake, we’ll pursue option 2.  After some poking around, I’ve found that cs-generate-sdf is based on a common open-source content extraction library – Apache Tika.  You might be familiar with Tika, as it’s the guts behind Solr’s ability to ingest unstructured data.  So if you’d be happy naively ingesting the content in Solr, you’ll probably be happy with the results that cs-generate-sdf produces.

  While we could build something more complex, let’s stick to bash here:

$ for d in `find /data/courts/US/ -type d`;
do
  /opt/aws/cloud-search-tools/bin/cs-generate-sdf --source "$d/*.html" -d scotus;
done

  A few things to note:

  • If you see error messages like “Request forbidden by administrative rules” or “403 Forbidden”, your access policies have not taken effect or you provided the wrong IP for the document service.
  • You should see lots of lines go by; two for every file that is being parsed.
  • This step can be parallelized, but will almost certainly be disk-bound unless you are running on some kind of RAID or NAS setup that allows for concurrent reads.

  This could take awhile; about 45 minutes to generate and transmit on my i7 2600k/32GB RAM/SATA III SSD workstation.  You should grab another coffee or beer and watch a show.

  Another caveat: even after you’ve transmitted all data up to the cloud, it will still take some time for the Cloud Search instance to churn through the data and complete indexing.

Searching the Cloud Search Domain

  Once the Cloud Search instance is fully built, it’s time to figure out how to search.  The best way to do this is, sadly, to read the developer documentation.  However, if you want to skip all the boring part, just try running something like this:

$ curl 'http://search-scotus-domain_id.us-east-1.cloudsearch.amazonaws.com/2011-02-01/search?q="clear%20and%20present%20danger"&return-fields=title'

  This search looks for an exact phrase match on “clear and present danger” and returns not only the document ID, but also the title property of the document.  You should get back something like this:

{"rank":"-text_relevance","match-expr":"(label '"clear and present danger"')","hits":{"found":100,"start":0,"hit":[{"id":"d__data_courts_us_395_395_us_444_492_html","data":{"title":["395 U.S. 444"]}},{"id":"d__data_courts_us_343_343_us_946_326_html","data":{"title":["343 U.S. 946"]}},{"id":"d__data_courts_us_341_341_us_494_336_html","data":{"title":["341 U.S. 494"]}},{"id":"d__data_courts_us_370_370_us_375_369_html","data":{"title":["370 U.S. 375"]}},{"id":"d__data_courts_us_435_435_us_829_76_1450_html","data":{"title":["435 U.S. 829"]}},{"id":"d__data_courts_us_328_328_us_331_473_html","data":{"title":["328 U.S. 331"]}},{"id":"d__data_courts_us_360_360_us_924_488_html","data":{"title":["360 U.S. 924"]}},{"id":"d__data_courts_us_414_414_us_890_72_6629_html","data":{"title":["414 U.S. 890"]}},{"id":"d__data_courts_us_295_295_us_441_665_html","data":{"title":["295 U.S. 441"]}},{"id":"d__data_courts_us_331_331_us_367_241_html","data":{"title":["331 U.S. 367"]}}]},"info":{"rid":"90c9b0fdba3e834bd8a0834c12371bbbcbe700391fa33547ff19c86ee8af36004f16216852072604","time-ms":5,"cpu-time-ms":0}}

  So, there it is! Your own fully searchable AWS Cloud Search domain for the Supreme Court. Not so bad after all, was it?

Tagged with: , , , , ,
Posted in Law, Programming, Research

Visualization of Reading Level Frequency by Congressional Bill Stage

  Here’s a fun example of how you might use my data on Congressional bill length and complexity.  Imagine you want to understand the empirical distribution of Flesch-Kincaid reading level for Congressional bills and how this distribution is related to bill stage.  A first step might be to visualize this relationship.

  Based on this visualization, you might infer that engrossed bills tend to have less right-skew and have a lower mean reading level.  The story behind this might be that Senators and Representatives are less likely to accept legislation they do not understand.  To test this, you might run a simple KS test to see if the introduced bill reading levels are greater than engrossed bill reading levels.

> ks.test(introduced, engrossed, alternative="less")

	Two-sample Kolmogorov-Smirnov test

data:  introduced and engrossed
D^- = 0.094, p-value = 0.006299
alternative hypothesis: the CDF of x lies below that of y

Sample source below.

Tagged with: , , , ,
Posted in Law, Programming, Research